DAWO: Drug-Aware and Cell-line-Aware Variational Autoencoder

Team Name

DAWO

Members

Yuhan Hao ([email protected])
Sheng-Yong Niu ([email protected])
Jaanak Prashar ([email protected])
Tiange (Alex) Cui ([email protected])
Danila Bredikhin ([email protected])
Mikaela Koutrouli ([email protected])

Project

Title

DAWO: Drug-Aware and Cell-line-Aware Variational Autoencoder for Drug Response Prediction

Demo - if Dawo was a startup 👀

https://www.loom.com/share/23bb756458684b2eb416265049068331?sid=c3d6fab1-9041-4309-91bb-1a36a2fa3e16

Overview

DAWO is a specialized Variational Autoencoder (VAE) designed to predict drug responses in cancer cell lines by integrating gene expression data with drug and cell line features. The model leverages multi-modal representation learning to capture complex interactions between drugs and cells, enabling more accurate prediction of drug responses across diverse conditions.

Motivation

Understanding and predicting how cancer cells respond to different therapeutic compounds is crucial for advancing precision medicine approaches in oncology. Traditional methods often fail to capture the complex relationships between drugs, cell lines, and their molecular profiles. DAWO addresses this challenge by combining a VAE architecture with drug-aware and cell-line-aware components to model these interactions effectively.

Methods

DAWO incorporates a multi-modal architecture with the following key components:

Gene Expression Encoder: Processes normalized gene expression data from cancer cell lines (input dimension: 5000)
Drug Feature Encoder: Processes drug features combining:

Drug summary embeddings
ChemBERTa molecular structure embeddings
Semantic feature embeddings (Total input dimension: 3122)

Cell Line Feature Encoder: Processes cell line features focusing on driver gene mutations and other genomic characteristics (input dimension: 113)
Latent Space: A 50-dimensional latent representation combining drug, cell line, and gene expression information
Decoder: Reconstructs gene expression profiles from the latent representation
Classifier: Predicts drug response categories from the latent representation (379 classes)

The model was trained using a combined loss function that balances reconstruction accuracy, latent space regularization, and classification performance.

Results

DAWO demonstrates strong performance in predicting drug responses across multiple cancer cell lines, with particular strength in:

Distinguishing between responsive and non-responsive cell lines for specific drugs
Generalizing to new drug-cell line combinations not seen during training
Capturing meaningful biological signals in the latent space that reflect known drug mechanisms and cellular pathways

Discussion

Our model provides a powerful framework for drug response prediction that could accelerate drug discovery and repurposing efforts. The integration of multi-modal data (gene expression, drug features, cell line characteristics) enables DAWO to capture complex interaction patterns that simpler models miss.

Limitations include the need for comprehensive feature sets for new drugs and cell lines, and potential biases from the training data distribution. Future work will focus on incorporating additional molecular modalities and expanding the training data to improve generalization across diverse drug classes.

Model Description

Using a variational autoencoder (VAE) approach, DAWO learns latent representations of these data sources and combines them to predict drug responses and identify potential drug-cell line interactions.

Model Inputs and Outputs

Inputs:

Gene Expression Data: Normalized gene expression profiles (shape: [batch_size, 5000])
Drug Features: Combined drug embeddings including:
Drug summary embeddings
ChemBERTa molecular structure embeddings
Semantic feature embeddings (Total shape: [batch_size, 3122])
Cell Line Features: Cell line genomic profiles (shape: [batch_size, 113])

Outputs:

Reconstructed Gene Expression: Reconstructed expression profiles (shape: [batch_size, 5000])
Latent Representation: Compressed representation in latent space (shape: [batch_size, 50])
Drug Response Predictions: Predicted response classes (shape: [batch_size, 379])
Response Probabilities: Softmax probabilities for each response class (shape: [batch_size, 379])

How to Use

from dawo_wrapper import DAWOWrapper


# Initialize model
model = DAWOWrapper(repo_path="path/to/model")


# Prepare inputs
# gene_expression: tensor of shape [batch_size, 5000]
# drug_features: tensor of shape [batch_size, 3122]
# cell_features: tensor of shape [batch_size, 113]


# Make predictions
results = model.predict(gene_expression, drug_features, cell_features)


# Access outputs
reconstructed_expression = results["x_hat"]
latent_representation = results["mu"]
drug_response_predictions = results["y_pred"]
response_probabilities = results["probs"]

Dataset

This model was developed using the Tahoe-100M dataset as part of the Tahoe-DeepDive Hackathon 2025.

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

shengyongniu
/

dawo