ClinOracle

Code

representations/ - scripts to train and transcriptome effect representations on Tahoe-100M
clinical_data_curation/ - scripts to curate clinical trial data
approval_prediction_benchmark.ipynb - benchmark on clinical approval prediction
classifier.py - Benchmarking classifier implementation

Data

clinical_evidence_data/ - Curated clinical evidence data on Tahoe drugs
data_for_classifier/ - input data for benchmarks
data/ - misc processed data

Team Members

Emma Dann - Stanford University & Gladstone Insitutes - [email protected]
Tony Zeng - Stanford University - [email protected]
Ross Giglio - Columbia University - [email protected]
Kevin Hoffer-Hawlik - Columbia University - [email protected]
Meer Mustafa - BigHat Biosciences - [email protected]

Project

Pharmacotranscriptomic representations to predict clinical trial success

Slides

Overview

Large in vitro perturbation screens like Tahoe-100M allow for assessing whether transcriptional responses are predictive of metrics of clinical success like drug approval.

Motivation

Despite rigorous research efforts, clinical success and drug approval is challenging and difficult to predict in early drug development.

Methods

Clinical trial information

We used LLMs to collected clinical trial and adverse effects data associated with the chemical agents screened in Tahoe-100M, annotated which drugs were tested or reached approval for a condition affecting one of the screened organs.

Transcriptome effects representations

E-distance: overall transcriptional shift from DMSO for each drug in each cell line. We selected the dose with max e-distance for each drug-cellline pair.
LDVAE: VAE with linear decoder for gene program interpretability (trained on plates 1-4 and generated embedding for full dataset)
mrVI: sample-aware VAE representation. Using the pseudobulked Tahoe-100M data, we trained a MrVI model with sample defined as cell_drug with the union of highly variable genes within cell line as features. We generated two-latent embeddings, the 10-dimensional u-space and the 30-dimensional z-space that were used as input to the classifier.

Benchmark set-up

We use logistic regression on the transcriptome-effect representations to predict whether a drug was approved for a tissue of interest, splitting drugs into train and test set and evaluating the precision-recall curve for the test drugs. We consider rate of approvals per organ as a technical confounder to be accounted for.

Results

None of the unsupervised multi-dimensional representations outperformed the approval rate baseline, while we found that e-distance is consistently negatively associated with approval for conditions affecting the target tissue.

Discussion and Future Work

With the concept established, we propose expanding by testing additional representations of the data including MrVI single-cell sample-sample distances, differential gene expression or program expression, and cell counts. The framework is setup to test additional and advanced prediction metrics like clinical trial phase success and AE rate or severity prediction.

emdann
/

clin-oracle-tahoe-deepdive