arxiv:2507.02092

Energy-Based Transformers are Scalable Learners and Thinkers

Published on Jul 2

· Submitted by

amanchadha on Jul 4

Upvote

Authors:

Aman Chadha ,

Abstract

Energy-Based Transformers, trained via unsupervised learning, outperform existing models in both scaling and inference across text and image tasks by re-framing predictions as optimization problems.

AI-generated summary

Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question "Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?" Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs) -- a new class of Energy-Based Models (EBMs) -- to assign an energy value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using fewer forward passes. Further, we find that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches. Consequently, EBTs are a promising new paradigm for scaling both the learning and thinking capabilities of models.

View arXiv page View PDF Add to collection

Community

amanchadha

Paper author Paper submitter 1 day ago

Energy-Based Transformers (EBTs) generalize System 2 Thinking to arbitrary modalities and problem types using a scalable, unsupervised energy-based optimization framework that combines verification, uncertainty modeling, and dynamic compute allocation.

Unified System 2 Thinking via Energy-Based Optimization: EBTs treat inference as iterative energy minimization over a learned verifier function, enabling dynamic computation, uncertainty modeling, and explicit prediction verification across both discrete and continuous modalities, entirely from unsupervised pretraining.
Scalable Transformer-Based EBM Architecture: EBTs implement autoregressive (GPT-style) and bidirectional (BERT/DiT-style) Transformer variants, achieving superior pretraining scaling across parameters, depth, data, batch size, and FLOPs—surpassing the Transformer++ recipe.
Inference-Time Thinking via Gradient Descent and Best-of-N Sampling: EBTs support reasoning-like behavior at inference using two methods: more gradient descent steps ("thinking longer") and selecting the lowest-energy prediction from multiple candidates ("self-verification"), both yielding significant gains, especially on out-of-distribution data.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.02092 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.02092 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.02092 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.