MolCRAFT Series for Drug Design: MolJO

Welcome to the official repository for the MolCRAFT series of projects! This series focuses on developing and improving deep learning models for structure-based drug design (SBDD) and molecule optimization (SBMO). Our goal is to create molecules with high binding affinity and plausible 3D conformations.

This repository contains the source code for the following projects:

MolCRAFT: Structure-Based Drug Design in Continuous Parameter Space (ICML'24)
MolJO: Empower Structure-Based Molecule Optimization with Gradient Guided Bayesian Flow Networks (ICML'25)
MolPilot: Piloting Structure-Based Drug Design via Modality-Specific Optimal Schedule (ICML'25)

📜 Overview

The MolCRAFT series addresses critical challenges in generative models for SBDD, including modeling molecular geometries, handling hybrid continuous-discrete spaces, and optimizing molecules against protein targets. Each project introduces novel methodologies and achieves state-of-the-art performance on relevant benchmarks.

🧭 Navigation

Folder	TL, DR	Description
MolCRAFT	Unified Space for Molecule Generation	MolCRAFT is the first SBDD generative model based on Bayesian Flow Network (BFN) operating in the unified continuous parameter space for different modalities, with variance reduction sampling strategy to generate high-quality samples with more than 10x speedup.
MolJO	Gradient-Guided Molecule Optimization	MolJO is a gradient-based Structure-Based Molecule Optimization (SBMO) framework derived within BFN. It employs joint guidance across continuous coordinates and discrete atom types, alongside a backward correction strategy for effective optimization.
MolPilot	Optimal Scheduling	MolPilot enhances SBDD by introducing a VLB-Optimal Scheduling (VOS) strategy for the twisted multimodal probability paths, significantly improving molecular geometries and interaction modeling, achieving 95.9% PB-Valid rate.

🚀 MolJO

Official implementation of ICML 2025 "Empower Structure-Based Molecule Optimization with Gradient Guided Bayesian Flow Networks".

Environment

It is highly recommended to install via docker if a Linux server with NVIDIA GPU is available.

Otherwise, you might check README for env for further details of docker or conda setup.

Prerequisite

A docker with nvidia-container-runtime enabled on your Linux system is required.

This repo provides an easy-to-use script to install docker and nvidia-container-runtime, in ./docker run sudo ./setup_docker_for_host.sh to set up your host machine.

For details, please refer to the install guide.

Install via Docker

We highly recommend you to set up the environment via docker, since all you need to do is a simple make command.

cd ./docker
make

Data

We use the same CrossDock dataset as previous approaches with affinity info (Vina Score). Data used for training / evaluating the model is obtained from KGDiff, and should be put in the data folder.

To train the property predictor from scratch, extract the files from the data.zip in Zenodo:

crossdocked_v1.1_rmsd1.0_pocket10_processed_final.lmdb
crossdocked_pocket10_pose_split.pt

To evaluate the model on the test set, download and unzip the test_set.zip into data folder. It includes the original PDB files that will be used in Vina Docking.

Training

python train_classifier.py --exp_name ${EXP_NAME} --revision ${REVISION} --prop_name ${PROPERTY} # affinity qed sa

where the other default values should be set the same as:

python train_bfn.py --config_file configs/train_prop.yaml --sigma1_coord 0.03 --beta1 1.5 --lr 5e-4 --time_emb_dim 1 --epochs 15 --max_grad_norm Q --destination_prediction True --use_discrete_t True

Sampling

We provide the pretrained checkpoints for property predictors (Vina Score, SA) in the pretrained Google Drive folder. The backbone checkpoint can be found here. After downloading them, please put the checkpoints under the pretrained folder.

Sampling for pockets in the testset

python sample_guided.py --num_samples ${NUM_MOLS_PER_POCKET} --objective ${OBJ} # vina_sa

where the other default values should be set the same as:

python sample_guided.py --config_file configs/test_opt.yaml --pos_grad_weight 50 --type_grad_weight 50 --guide_mode param_naive --sample_steps 200 --sample_num_atoms prior

Sampling from pdb file

To sample from a whole protein pdb file, we need the corresponding reference ligand to clip the protein pocket (a 10A region around the reference position).

python sample_for_pocket_guided.py --protein_path ${PDB_PATH} --ligand_path ${SDF_PATH}

Evaluation

Evaluating meta files

We provide our samples as moljo_vina_sa_vina_docked_pose_checked.pt in the sample Google Drive folder.

Citation

@article{qiu2025empower,
  title={Empower Structure-Based Molecule Optimization with Gradient Guided Bayesian Flow Networks},
  author={Qiu, Keyue and Song, Yuxuan and Yu, Jie and Ma, Hongbo and Cao, Ziyao and Zhang, Zhilong and Wu, Yushuai and Zheng, Mingyue and Zhou, Hao and Ma, Wei-Ying},
  journal={ICML 2025},
  year={2025}
}