Papers
arxiv:2503.17361

Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation

Published on Mar 21
· Submitted by pranamanam on Mar 26
Authors:
,
,
,

Abstract

Flow matching in the continuous simplex has emerged as a promising strategy for DNA sequence design, but struggles to scale to higher simplex dimensions required for peptide and protein generation. We introduce Gumbel-Softmax Flow and Score Matching, a generative framework on the simplex based on a novel Gumbel-Softmax interpolant with a time-dependent temperature. Using this interpolant, we introduce Gumbel-Softmax Flow Matching by deriving a parameterized velocity field that transports from smooth categorical distributions to distributions concentrated at a single vertex of the simplex. We alternatively present Gumbel-Softmax Score Matching which learns to regress the gradient of the probability density. Our framework enables high-quality, diverse generation and scales efficiently to higher-dimensional simplices. To enable training-free guidance, we propose Straight-Through Guided Flows (STGFlow), a classifier-based guidance method that leverages straight-through estimators to steer the unconditional velocity field toward optimal vertices of the simplex. STGFlow enables efficient inference-time guidance using classifiers pre-trained on clean sequences, and can be used with any discrete flow method. Together, these components form a robust framework for controllable de novo sequence generation. We demonstrate state-of-the-art performance in conditional DNA promoter design, sequence-only protein generation, and target-binding peptide design for rare disease treatment.

Community

Paper author Paper submitter

Flow matching in discrete spaces is powerful but limited by high variance (Dirichlet FM), potential overfitting (Fisher FM), and rigidity. We build on this by using a Gumbel-Softmax interpolant with time-dependent temperature to create smooth, learnable flows across the simplex! 🔥

Our model, Gumbel-Softmax FM, smoothly transforms noise into clean sequences—avoiding hard discretization while learning better transport paths. We also introduce Gumbel-Softmax Score Matching, learning the score function over the simplex for stochastic sampling! 🎯

But generative control is hard post-training. 😓 So we introduce STGFlow — a training-free guidance method using straight-through gradients from pre-trained classifiers to steer flow trajectories at inference. No need to retrain: plug-and-play guidance for peptides, DNA, and proteins!!🧠

Across bioengineering tasks—DNA promoter design, de novo protein generation, and peptide binder design—our framework outperforms autoregressive and discrete diffusion baselines in fidelity, foldability, and functional control.📈🧬

We even show de novo peptide binders to targets with no known binders, including proteins involved in rare pediatric leukodystrophies and neurodegenerative diseases.💊 Our binders further show better docking scores and ipTM values than known binders across 13 targets and scramble controls! 🙌

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.17361 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.17361 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.17361 in a Space README.md to link it from this page.

Collections including this paper 1