ClinOracle

Contents

Code

  • representations/ - scripts to train and transcriptome effect representations on Tahoe-100M
  • clinical_data_curation/ - scripts to curate clinical trial data
  • approval_prediction_benchmark.ipynb - benchmark on clinical approval prediction
  • classifier.py - Benchmarking classifier implementation

Data

  • clinical_evidence_data/ - Curated clinical evidence data on Tahoe drugs
  • data_for_classifier/ - input data for benchmarks
  • data/ - misc processed data

Team Members

Project

Pharmacotranscriptomic representations to predict clinical trial success

Slides

Overview

Large in vitro perturbation screens like Tahoe-100M allow for assessing whether transcriptional responses are predictive of metrics of clinical success like drug approval.

Motivation

Despite rigorous research efforts, clinical success and drug approval is challenging and difficult to predict in early drug development.

Methods

Clinical trial information

We used LLMs to collected clinical trial and adverse effects data associated with the chemical agents screened in Tahoe-100M, annotated which drugs were tested or reached approval for a condition affecting one of the screened organs.

Transcriptome effects representations

  • E-distance: overall transcriptional shift from DMSO for each drug in each cell line. We selected the dose with max e-distance for each drug-cellline pair.
  • LDVAE: VAE with linear decoder for gene program interpretability (trained on plates 1-4 and generated embedding for full dataset)
  • mrVI: sample-aware VAE representation. Using the pseudobulked Tahoe-100M data, we trained a MrVI model with sample defined as cell_drug with the union of highly variable genes within cell line as features. We generated two-latent embeddings, the 10-dimensional u-space and the 30-dimensional z-space that were used as input to the classifier.

Benchmark set-up

We use logistic regression on the transcriptome-effect representations to predict whether a drug was approved for a tissue of interest, splitting drugs into train and test set and evaluating the precision-recall curve for the test drugs. We consider rate of approvals per organ as a technical confounder to be accounted for.

Results

None of the unsupervised multi-dimensional representations outperformed the approval rate baseline, while we found that e-distance is consistently negatively associated with approval for conditions affecting the target tissue.

Discussion and Future Work

With the concept established, we propose expanding by testing additional representations of the data including MrVI single-cell sample-sample distances, differential gene expression or program expression, and cell counts. The framework is setup to test additional and advanced prediction metrics like clinical trial phase success and AE rate or severity prediction.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train emdann/clin-oracle-tahoe-deepdive