Issues accessing the data
Hi, thanks for sharing the data here! I am having some issues accessing the model and the adata:
- when running the "get started" code "tahoe = tahoe_hubmodel.model" I got this error:
KeyError: '_scvi_latent_qzm not found in adata.obsm.' - I downloaded the "adata.h5ad" file and fount that the "counts" matrix from the "layers" of adata is with all 0s!
Could you please help address these issues? Thank you!
Hi @Hephen ,
Regarding 1.)
I just tried these commands:
> import scvi.hub
> tahoe_hubmodel = scvi.hub.HubModel.pull_from_huggingface_hub(
repo_name = 'vevotx/Tahoe-100M-SCVI-v1',
)
/drive_1/tmp /ipykernel_1405746/2535911278.py:1: UserWarning: No revision was passed, so the default (latest) revision will be used.
tahoe_hubmodel = scvi.hub.HubModel.pull_from_huggingface_hub(
Fetching 3 files: 100%|ββββββββββ| 3/3 [00:00<00:00, 12.88it/s]
> tahoe = tahoe_hubmodel.model # This takes ~11 minutes
INFO Loading model...
INFO File
/home/valentine/.cache/huggingface/hub/models--vevotx--Tahoe-100M-SCVI-v1/snapshots/b5283a73fbbed812a95264
ace360da538b20af89/model.pt already downloaded
> tahoe
SCVI model with the following parameters:
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: nb,
latent_distribution: normal.
Training status: Trained
Model's adata is minified?: True
> tahoe.adata.obsm
AxisArrays with keys: X_latent_qzm, X_latent_qzv, scvi_latent_qzm, scvi_latent_qzv
Which version of scvi-tools are you using? It could be that an older version required the .obsm
keys to start with '_'
. Here, I tried with version 1.3.0.
You can check like this:
> import scvi
> scvi.__version__
'1.3.0'
Something you could do (but you shouldn't need to..), is read in the anndata file manually and change the adata.obsm['scvi_latent_qzm']
to adata.obsm['_scvi_latent_qzm']
, and similar for ..._qzv
.
I'm hoping it's just something that changed with the versions.
Regarding 2.)
Yes this is part of the 'minification' process :) https://docs.scvi-tools.org/en/stable/tutorials/notebooks/use_cases/minification.html
Instead of storing the 1.1 TB of count data in X, we just store encoded 'scvi_latent_qzm'
and 'scvi_latent_qzm
' vectors that can be decoded to gene expression levels and simulated counts.
Best,
/Valentine
Hi @Hephen ,
As described in the example on the model card, to get normalized gene expression, you can do this:
gene_expression = tahoe.get_normalized_expression(adata, indices = cell_indices, gene_list = gene_list)
where cell_indices
is a subset of cells, and gene_list
is a subset of genes.
If you want simulated counts form the minified anndata, you can do this:
umi_counts = tahoe.posterior_predictive_sample(adata, indices = cell_indices, gene_list = gene_list)
(I would recommend not generating the full matrices with all cells and genes, it will be very large)
Thank you for the clarifications,
@valsv
, and great work! One question related to the one OP asked -- would it be possible for you guys to share the "highly variable genes" (10,126) identified in the study? I was interested in understanding how much the difference between scVI recon and true normalized expression values would impact downstream analysis. I'm assuming the normalized expression are obtained via some simple operations similar to sc.pp.normalize_total
and sc.pp.log1p
. As outlined in README:
Calibration analysis shows that the model generates counts that contains the observed counts within the 95% confidence intervals from the posterior predictive distribution 97.7% of the time. However, a naive baseline of producing only 0-counts achieves 97.4% on the same metric.
Thanks in advance!
@Hephen Yes, the intended use case for this model is to be able to do analysis with limited resources. This is can be achieved by making use of the representation vectors for the cells which are decompressed to approximated data using the SCVI model.
The original counts are hosted by at https://arcinstitute.org/tools/virtualcellatlas (direct link to download details: https://github.com/ArcInstitute/arc-virtual-cell-atlas/blob/main/tahoe-100/README.md). The instructions will enable you to download the full training data as a collection of .h5ad.gz
files.
Additionally, the full data is also available in a different format as a Hugging Face dataset here: https://huggingface.co/datasets/vevotx/Tahoe-100M
Hi @jasperhyp ,
This model is trained and evaluated on all genes. Normalized expression is the output from the SCVI model. The posterior predictive distributions used for evaluation and criticism are distributions of counts compared to observed counts.