Upload 7 files

Browse files

Files changed (7) hide show

Dawo_model.ipynb +0 -0
README.md +153 -0
config.json +11 -0
dawo.py +122 -0
dawo_wrapper.py +60 -0
example.py +99 -0
requirements.txt +5 -0

Dawo_model.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

README.md ADDED Viewed

	@@ -0,0 +1,153 @@

+---
+language: code
+license: mit
+library_name: pytorch
+tags:
+- variational-autoencoder
+- drug-response
+- vae
+- cancer-drug
+- tahoe-deepdive
+datasets:
+- biomedical
+- tahoebio/Tahoe-100M
+---
+# DAWO: Drug-Aware and Cell-line-Aware Variational Autoencoder
+[![tahoe-deepdive](https://img.shields.io/badge/tag-tahoe--deepdive-blue)](https://huggingface.co/datasets/tahoebio/Tahoe-100M)
+## Team Name
+DAWO
+## Members
+- Yuhan Hao
+- Sheng-Yong Niu
+- Jaanak Prashar
+- Tiange (Alex) Cui
+- Danila Bredikhin
+- Mikaela Koutrouli
+## Project
+### Title
+DAWO: Drug-Aware and Cell-line-Aware Variational Autoencoder for Drug Response Prediction
+### Overview
+DAWO is a specialized Variational Autoencoder (VAE) designed to predict drug responses in cancer cell lines by integrating gene expression data with drug and cell line features. The model leverages multi-modal representation learning to capture complex interactions between drugs and cells, enabling more accurate prediction of drug responses across diverse conditions.
+### Motivation
+Understanding and predicting how cancer cells respond to different therapeutic compounds is crucial for advancing precision medicine approaches in oncology. Traditional methods often fail to capture the complex relationships between drugs, cell lines, and their molecular profiles. DAWO addresses this challenge by combining a VAE architecture with drug-aware and cell-line-aware components to model these interactions effectively.
+### Methods
+DAWO incorporates a multi-modal architecture with the following key components:
+1. **Gene Expression Encoder**: Processes normalized gene expression data from cancer cell lines (input dimension: 5000)
+2. **Drug Feature Encoder**: Processes drug features combining:
+  - Drug summary embeddings
+  - ChemBERTa molecular structure embeddings
+  - Semantic feature embeddings
+  (Total input dimension: 3122)
+3. **Cell Line Feature Encoder**: Processes cell line features focusing on driver gene mutations and other genomic characteristics (input dimension: 113)
+4. **Latent Space**: A 50-dimensional latent representation combining drug, cell line, and gene expression information
+5. **Decoder**: Reconstructs gene expression profiles from the latent representation
+6. **Classifier**: Predicts drug response categories from the latent representation (379 classes)
+The model was trained using a combined loss function that balances reconstruction accuracy, latent space regularization, and classification performance.
+### Results
+DAWO demonstrates strong performance in predicting drug responses across multiple cancer cell lines, with particular strength in:
+1. Distinguishing between responsive and non-responsive cell lines for specific drugs
+2. Generalizing to new drug-cell line combinations not seen during training
+3. Capturing meaningful biological signals in the latent space that reflect known drug mechanisms and cellular pathways
+### Discussion
+Our model provides a powerful framework for drug response prediction that could accelerate drug discovery and repurposing efforts. The integration of multi-modal data (gene expression, drug features, cell line characteristics) enables DAWO to capture complex interaction patterns that simpler models miss.
+Limitations include the need for comprehensive feature sets for new drugs and cell lines, and potential biases from the training data distribution. Future work will focus on incorporating additional molecular modalities and expanding the training data to improve generalization across diverse drug classes.
+## Model Description
+Using a variational autoencoder (VAE) approach, DAWO learns latent representations of these data sources and combines them to predict drug responses and identify potential drug-cell line interactions.
+## Model Inputs and Outputs
+### Inputs:
+- **Gene Expression Data**: Normalized gene expression profiles (shape: [batch_size, 5000])
+- **Drug Features**: Combined drug embeddings including:
+ - Drug summary embeddings
+ - ChemBERTa molecular structure embeddings
+ - Semantic feature embeddings
+ (Total shape: [batch_size, 3122])
+- **Cell Line Features**: Cell line genomic profiles (shape: [batch_size, 113])
+### Outputs:
+- **Reconstructed Gene Expression**: Reconstructed expression profiles (shape: [batch_size, 5000])
+- **Latent Representation**: Compressed representation in latent space (shape: [batch_size, 50])
+- **Drug Response Predictions**: Predicted response classes (shape: [batch_size, 379])
+- **Response Probabilities**: Softmax probabilities for each response class (shape: [batch_size, 379])
+## How to Use
+```python
+from dawo_wrapper import DAWOWrapper
+# Initialize model
+model = DAWOWrapper(repo_path="path/to/model")
+# Prepare inputs
+# gene_expression: tensor of shape [batch_size, 5000]
+# drug_features: tensor of shape [batch_size, 3122]
+# cell_features: tensor of shape [batch_size, 113]
+# Make predictions
+results = model.predict(gene_expression, drug_features, cell_features)
+# Access outputs
+reconstructed_expression = results["x_hat"]
+latent_representation = results["mu"]
+drug_response_predictions = results["y_pred"]
+response_probabilities = results["probs"]
+```
+## Dataset
+This model was developed using the [Tahoe-100M](https://huggingface.co/datasets/tahoebio/Tahoe-100M) dataset as part of the Tahoe-DeepDive Hackathon 2025.
+## License
+MIT License
+Copyright (c) 2023 Team DAWO
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+    "input_dim_X": 5000,
+    "input_dim_Y": 3122,
+    "input_dim_Z": 113,
+    "latent_dim": 50,
+    "latent_dim_mid": 500,
+    "Y_emb": 50,
+    "Z_emb": 50,
+    "num_classes": 379,
+    "beta": 0.1
+}

dawo.py ADDED Viewed

	@@ -0,0 +1,122 @@

+import numpy as np
+import pandas as pd
+import scipy.sparse as sp
+import torch
+import torch.nn as nn
+import torch.optim as optim
+from torch.utils.data import Dataset, TensorDataset
+import torch.nn.functional as F
+def Anndata_to_Tensor(adata, label=None, label_continuous= None ,batch=None, device='cpu'):
+    # sparse matrix to tensor
+    if isinstance(adata.X, (sp.csr.csr_matrix, sp.csc.csc_matrix)):
+        X_tensor = torch.tensor(adata.X.toarray(), dtype=torch.float32).to(device)
+    else:
+        X_tensor = torch.tensor(adata.X, dtype=torch.float32).to(device)
+    tensors = {'X_tensor': X_tensor}
+    if label is not None:
+        labels_num, _ = pd.factorize(adata.obs[label], sort=True)
+        tensors['labels_num'] = torch.tensor(labels_num, dtype=torch.long)
+    if label_continuous is not None:
+        tensors['label_continuous'] = torch.tensor(adata.obs[label_continuous], dtype=torch.float64)
+    if batch is not None:
+        batch_one_hot = pd.get_dummies(adata.obs[batch]).to_numpy()
+        tensors['batch_one_hot'] = torch.from_numpy(batch_one_hot)
+    if len(tensors) == 1 and 'X_tensor' in tensors:
+        return tensors['X_tensor']
+    else:
+        # return TensorDataset with available tensors
+        return TensorDataset(*tensors.values())
+def loss_function(x_hat, x, mu, logvar, β=0.1):
+    BCE = nn.functional.mse_loss(
+        x_hat, x.view(-1, x_hat.shape[1]), reduction='sum'
+    )
+    KLD = 0.5 * torch.sum(logvar.exp() - logvar - 1 + mu.pow(2))
+    return BCE+  β * KLD
+class DAWO(nn.Module):
+    def __init__(self, input_dim_X, input_dim_Y, input_dim_Z, latent_dim_mid=500, latent_dim=50, Y_emb=50, Z_emb=50, num_classes=10):
+        super(DAWO, self).__init__()
+        self.encoder = nn.Sequential(
+            nn.BatchNorm1d(input_dim_X),
+            nn.Linear(input_dim_X, latent_dim_mid),
+            nn.ReLU(),
+            nn.Dropout(0.2),
+            nn.Linear(latent_dim_mid, latent_dim * 2),
+        )
+        self.encoder_Y = nn.Sequential(
+            nn.BatchNorm1d(input_dim_Y),
+            nn.Linear(input_dim_Y, latent_dim_mid),
+            nn.ReLU(),
+            nn.Dropout(0.2),
+            nn.Linear(latent_dim_mid, Y_emb),
+        )
+        self.encoder_Z = nn.Sequential(
+            nn.BatchNorm1d(input_dim_Z),
+            nn.Linear(input_dim_Z, latent_dim_mid),
+            nn.ReLU(),
+            nn.Dropout(0.2),
+            nn.Linear(latent_dim_mid, Z_emb),
+        )
+        self.decoder = nn.Sequential(
+            nn.Linear(latent_dim + Y_emb + Z_emb, latent_dim_mid),
+            nn.ReLU(),
+            nn.Dropout(0.2),
+            nn.Linear(latent_dim_mid, input_dim_X),
+        )
+        self.classifier = nn.Sequential(
+            nn.BatchNorm1d(latent_dim + Z_emb),
+            nn.Linear(latent_dim + Z_emb, 256),
+            nn.ReLU(),
+            nn.Dropout(0.2),
+            nn.Linear(256, 128),
+            nn.ReLU(),
+            nn.Dropout(0.2),
+            nn.Linear(128, num_classes)
+        )
+        self.input_dim = input_dim_X
+        self.input_dim_Y = input_dim_Y
+        self.input_dim_Z = input_dim_Z
+        self.latent_dim = latent_dim
+    def reparameterise(self, mu, logvar):
+        if self.training:
+            std = logvar.mul(0.5).exp_()
+            eps = torch.randn_like(std)
+            return eps.mul(std).add_(mu)
+        else:
+            return mu
+    def forward(self, x, y, z):
+        mu_logvar = self.encoder(x.view(-1, self.input_dim)).view(-1, 2, self.latent_dim)
+        l_y = self.encoder_Y(y.view(-1, self.input_dim_Y))
+        l_z = self.encoder_Z(z.view(-1, self.input_dim_Z))
+        mu = mu_logvar[:, 0, :]
+        logvar = mu_logvar[:, 1, :]
+        l_x = self.reparameterise(mu, logvar)
+        l_xyz = torch.cat((l_x, l_y, l_z), dim=1)
+        l_xz = torch.cat((l_x, l_z), dim=1)
+        x_hat = self.decoder(l_xyz)
+        y_pred = self.classifier(l_xz)
+        return x_hat, mu, logvar, y_pred

dawo_wrapper.py ADDED Viewed

	@@ -0,0 +1,60 @@

+import os
+import json
+import torch
+import numpy as np
+from dawo import DAWO, loss_function, Anndata_to_Tensor
+class DAWOWrapper:
+    """
+    Minimal wrapper for DAWO model to use with Hugging Face Hub
+    """
+    def __init__(self, repo_path):
+        """
+        Initialize the DAWO model
+        Args:
+            repo_path: Path to repository with model files
+        """
+        # Load configuration
+        config_path = os.path.join(repo_path, "config.json")
+        with open(config_path, 'r') as f:
+            config = json.load(f)
+        # Create model with original DAWO class
+        self.model = DAWO(
+            input_dim_X=config["input_dim_X"],
+            input_dim_Y=config["input_dim_Y"],
+            input_dim_Z=config["input_dim_Z"],
+            latent_dim=config["latent_dim"],
+            Y_emb=config["Y_emb"],
+            Z_emb=config["Z_emb"],
+            num_classes=config["num_classes"]
+        )
+        # Load weights
+        self.model.load_state_dict(torch.load(os.path.join(repo_path, "model.pth")))
+        self.model.eval()
+    def predict(self, x, y, z):
+        """
+        Make predictions with the DAWO model
+        Args:
+            x: Gene expression tensor (batch_size, input_dim_X)
+            y: Drug feature tensor (batch_size, input_dim_Y)
+            z: Cell line feature tensor (batch_size, input_dim_Z)
+        Returns:
+            Dict with model outputs
+        """
+        with torch.no_grad():
+            x_hat, mu, logvar, y_pred = self.model(x, y, z)
+        return {
+            "x_hat": x_hat,              # Reconstructed gene expression
+            "mu": mu,                    # Latent mean
+            "logvar": logvar,            # Latent log variance
+            "y_pred": y_pred,            # Drug response predictions
+            "probs": torch.softmax(y_pred, dim=1)  # Drug response probabilities
+        }

example.py ADDED Viewed

	@@ -0,0 +1,99 @@

+import torch
+import numpy as np
+import pandas as pd
+import json
+from dawo_wrapper import DAWOWrapper
+print("DAWO Model Example: Drug Response Prediction")
+print("============================================")
+# Initialize the model
+print("\n1. Loading the DAWO model...")
+model = DAWOWrapper(repo_path="./")
+# Load data files from the data folder
+print("\n2. Loading drug and cell line features...")
+# Set data directory (use local data directory)
+data_dir = "./data"
+# Drug feature components
+print("   - Loading drug semantic features...")
+drug_semantic = pd.read_csv(f'{data_dir}/semantic_features_combined.csv', index_col='drug')
+print(f"     Shape: {drug_semantic.shape}")
+print("   - Loading drug structure embeddings...")
+drug_structure = pd.read_csv(f'{data_dir}/chemberta_cls_embeddings.csv', index_col='drug')
+print(f"     Shape: {drug_structure.shape}")
+print("   - Loading drug summary embeddings...")
+with open(f'{data_dir}/drug_summaries.json', 'r') as f:
+    drug_name = json.load(f)
+drug_emb = np.load(f'{data_dir}/drug_summary_lowd.npy')
+print(f"     Shape: {drug_emb.shape}")
+# Cell line features
+print("   - Loading cell line driver gene mutation profiles...")
+cell_features = pd.read_parquet(f'{data_dir}/drivergene_cellline_matrix.parquet')
+cell_features.index = cell_features['cell_name']
+cell_features.drop(columns=['cell_name'], inplace=True)
+print(f"     Shape: {cell_features.shape}")
+# Select sample drug and cell line
+print("\n3. Preparing inputs for prediction:")
+# Select a drug for demonstration
+sample_drug = list(drug_name.keys())[0]
+print(f"   - Selected drug: {sample_drug}")
+# Create complete drug feature vector by concatenating the three embedding types
+print("   - Constructing drug feature vector...")
+drug_idx = list(drug_name.keys()).index(sample_drug)
+drug_feature = np.concatenate((
+    drug_emb[drug_idx],                  # Drug summary embedding
+    drug_structure.loc[sample_drug].values,  # Molecular structure embedding
+    drug_semantic.loc[sample_drug].values    # Semantic feature embedding
+))
+drug_features = torch.tensor(drug_feature, dtype=torch.float32).unsqueeze(0)  # Add batch dimension
+print(f"     Combined drug feature shape: {drug_features.shape}")
+# Select a cell line for demonstration
+sample_cell = cell_features.index[0]
+print(f"   - Selected cell line: {sample_cell}")
+# Create cell line feature vector
+print("   - Constructing cell line feature vector...")
+cell_feature = cell_features.loc[sample_cell].values
+cell_features_tensor = torch.tensor(cell_feature, dtype=torch.float32).unsqueeze(0)  # Add batch dimension
+print(f"     Original cell feature shape: {cell_features_tensor.shape}")
+# Pad cell features to match the expected dimension (113)
+print("   - Padding cell features to match model dimensions...")
+padded_features = torch.zeros((1, 113), dtype=torch.float32)
+padded_features[0, :cell_features_tensor.shape[1]] = cell_features_tensor
+cell_features_tensor = padded_features
+print(f"     Padded cell feature shape: {cell_features_tensor.shape}")
+# Create simulated gene expression data (normally this would be real data)
+print("   - Creating sample gene expression data...")
+gene_expression = torch.randn(1, 5000)  # 1 sample, 5000 genes
+print(f"     Gene expression shape: {gene_expression.shape}")
+# Run prediction with prepared data
+print("\n4. Running prediction with DAWO model...")
+results = model.predict(gene_expression, drug_features, cell_features_tensor)
+# Print results
+print("\n5. Results:")
+print(f"   - Reconstructed gene expression shape: {results['x_hat'].shape}")
+print(f"   - Latent representation shape: {results['mu'].shape}")
+print(f"   - Drug response prediction shape: {results['y_pred'].shape}")
+print(f"   - Response probabilities shape: {results['probs'].shape}")
+# Show top predicted classes
+print("\n   - Top predicted drug response classes:")
+probs = results['probs'].squeeze().numpy()
+top3_indices = np.argsort(probs)[-3:][::-1]
+for i, idx in enumerate(top3_indices):
+    print(f"     Class {idx}: {probs[idx]:.4f} probability")

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+torch>=1.10.0
+numpy>=1.20.0
+pandas>=1.3.0
+scipy>=1.7.0
+pyarrow>=7.0.0