Issues accessing the data

#1
by Hephen - opened

Hi, thanks for sharing the data here! I am having some issues accessing the model and the adata:

  1. when running the "get started" code "tahoe = tahoe_hubmodel.model" I got this error:
    KeyError: '_scvi_latent_qzm not found in adata.obsm.'
  2. I downloaded the "adata.h5ad" file and fount that the "counts" matrix from the "layers" of adata is with all 0s!
    Could you please help address these issues? Thank you!
Tahoe Bio org

Hi @Hephen ,

Regarding 1.)

I just tried these commands:

> import scvi.hub
> tahoe_hubmodel = scvi.hub.HubModel.pull_from_huggingface_hub(
    repo_name = 'vevotx/Tahoe-100M-SCVI-v1',
)

/drive_1/tmp  /ipykernel_1405746/2535911278.py:1: UserWarning: No revision was passed, so the default (latest) revision will be used.
  tahoe_hubmodel = scvi.hub.HubModel.pull_from_huggingface_hub(
Fetching 3 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 12.88it/s]

> tahoe = tahoe_hubmodel.model  # This takes ~11 minutes
INFO     Loading model...                                                                                          
INFO     File                                                                                                      
         /home/valentine/.cache/huggingface/hub/models--vevotx--Tahoe-100M-SCVI-v1/snapshots/b5283a73fbbed812a95264
         ace360da538b20af89/model.pt already downloaded      

> tahoe
SCVI model with the following parameters: 
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: nb, 
latent_distribution: normal.
Training status: Trained
Model's adata is minified?: True

> tahoe.adata.obsm
AxisArrays with keys: X_latent_qzm, X_latent_qzv, scvi_latent_qzm, scvi_latent_qzv

Which version of scvi-tools are you using? It could be that an older version required the .obsm keys to start with '_'. Here, I tried with version 1.3.0.

You can check like this:

> import scvi
> scvi.__version__
'1.3.0'

Something you could do (but you shouldn't need to..), is read in the anndata file manually and change the adata.obsm['scvi_latent_qzm'] to adata.obsm['_scvi_latent_qzm'], and similar for ..._qzv.

I'm hoping it's just something that changed with the versions.

Regarding 2.)

Yes this is part of the 'minification' process :) https://docs.scvi-tools.org/en/stable/tutorials/notebooks/use_cases/minification.html

Instead of storing the 1.1 TB of count data in X, we just store encoded 'scvi_latent_qzm' and 'scvi_latent_qzm' vectors that can be decoded to gene expression levels and simulated counts.

Best,
/Valentine

Hi @valsv Valentine, Thank you so much for your reply! Those make a lot of sense!
May I ask how to obtain the gene expression levels (i.e. the counts matrix) from the current adata.h5ad ? Thanks!

Hi @Hephen ,

As described in the example on the model card, to get normalized gene expression, you can do this:

gene_expression = tahoe.get_normalized_expression(adata, indices = cell_indices, gene_list = gene_list)

where cell_indices is a subset of cells, and gene_list is a subset of genes.

If you want simulated counts form the minified anndata, you can do this:

umi_counts = tahoe.posterior_predictive_sample(adata, indices = cell_indices, gene_list = gene_list)

(I would recommend not generating the full matrices with all cells and genes, it will be very large)

Thanks @valsv !!
So we can only get simulated counts from the adata.h5ad ? I wonder how close they are to the original or real raw counts?
Also, is it possible to get counts of genes from the model ?
Thanks again!

Thank you for the clarifications, @valsv , and great work! One question related to the one OP asked -- would it be possible for you guys to share the "highly variable genes" (10,126) identified in the study? I was interested in understanding how much the difference between scVI recon and true normalized expression values would impact downstream analysis. I'm assuming the normalized expression are obtained via some simple operations similar to sc.pp.normalize_total and sc.pp.log1p. As outlined in README:

Calibration analysis shows that the model generates counts that contains the observed counts within the 95% confidence intervals from the posterior predictive distribution 97.7% of the time. However, a naive baseline of producing only 0-counts achieves 97.4% on the same metric.

Thanks in advance!

@Hephen I think it was mentioned that tahoe.posterior_predictive_sample can be used to generate counts.

Tahoe Bio org

@Hephen Yes, the intended use case for this model is to be able to do analysis with limited resources. This is can be achieved by making use of the representation vectors for the cells which are decompressed to approximated data using the SCVI model.

The original counts are hosted by at https://arcinstitute.org/tools/virtualcellatlas (direct link to download details: https://github.com/ArcInstitute/arc-virtual-cell-atlas/blob/main/tahoe-100/README.md). The instructions will enable you to download the full training data as a collection of .h5ad.gz files.

Additionally, the full data is also available in a different format as a Hugging Face dataset here: https://huggingface.co/datasets/vevotx/Tahoe-100M

Tahoe Bio org

Hi @jasperhyp ,

This model is trained and evaluated on all genes. Normalized expression is the output from the SCVI model. The posterior predictive distributions used for evaluation and criticism are distributions of counts compared to observed counts.

@valsv Thank you for clarifying the normalized expression! I am still interested in looking at the 10,126 highly variable genes as identified in the study to ensure consistency with your experiment setups. Though, no worries if this information isn't available yet!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment