SyedA5688 commited on
Commit
009b04d
·
verified ·
1 Parent(s): cc9e786

Added files for 1B C2S-Scale-Pythia model

Browse files
Files changed (1) hide show
  1. README.md +59 -3
README.md CHANGED
@@ -1,3 +1,59 @@
1
- ---
2
- license: cc-by-nc-nd-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-nd-4.0
3
+ language:
4
+ - en
5
+ base_model: EleutherAI/pythia-1b
6
+ library_name: transformers
7
+ tags:
8
+ - biology
9
+ - scRNAseq
10
+ ---
11
+
12
+
13
+ # Overview
14
+ This is the C2S-Scale-1B pretrained model, based on the Pythia-1b architecture
15
+ developed by EleutherAI, fine-tuned using the Cell2Sentence (C2S) framework on a wide array of single-cell RNA sequencing
16
+ (scRNA-seq) datasets from CellxGene and the Human Cell Atlas. Cell2Sentence is a cutting-edge method that
17
+ adapts large language models (LLMs) to single-cell biology by converting scRNA-seq data into
18
+ "cell sentences" — ordered sequences of gene names based on expression levels. This model has been trained
19
+ to perform a broad range of single- and multi-cell tasks, making it a versatile tool for various single-cell
20
+ and multi-cell analyses.
21
+
22
+ # Training Data
23
+ This model was trained on over 57 million human and mouse cells gathered from over 800 single-cell RNA sequencing
24
+ datasets from CellxGene and the Human Cell Atlas. This dataset covers a broad range of cell types and conditions
25
+ from multiple tissues in both human and mouse.
26
+
27
+ This model was trained with a variable number of genes per cell sentence, with a maximum context length of 8192 tokens.
28
+ The context length of the default Pythia model was extended using rotary positional embeddings prior to C2S training.
29
+ - Cells: For multi cell samples, each training sample contained between 5 and 20 cells, with the same number of genes for each of the cells in the same sample.
30
+ - Genes: For single cell samples, each cell sentence contained between 100 and 2048 genes. For multi cell samples, each cell sentence per cell contained between 100 and 400 genes.
31
+
32
+ # Tasks
33
+ This model is designed for the following tasks:
34
+
35
+ Single-Cell Tasks
36
+ - Unconditional single-cell generation: Generate single cell sentences unconditionally.
37
+ - Cell type prediction: Predict the cell type of a given single cell.
38
+ - Cell type-conditioned generation: Generate a single cell sentence conditioned on a specific cell type.
39
+
40
+ Multi-Cell Tasks
41
+ - Unconditional multi-cell generation: Generate multiple cell sentences unconditionally.
42
+ - Tissue prediction: Predict the tissue of origin for a group of cells.
43
+ - Cell type prediction: Predict the cell type for each cell in a group of multiple cells.
44
+ - Tissue-conditioned multi-cell generation: Generate multiple cell sentences conditioned on a specific tissue.
45
+ - Cell type-conditioned multi-cell generation: Generate multiple cell sentences conditioned on the cell type of each individual cell.
46
+ - Multi-cells to abstract: Generate a research paper abstract based on the provided multi-cell sentences.
47
+ - Abstract to multi-cells: Generate multiple cell sentences based on a given research paper abstract.
48
+
49
+ Gene Set Tasks
50
+ - Gene set name to genes: Generate an alphabetical list of genes given a gene set name.
51
+ - Genes to gene set name: Generate the name of a gene set given an alphabetical list of genes.
52
+
53
+ # Cell2Sentence Links
54
+ - GitHub: https://github.com/vandijklab/cell2sentence
55
+ - Paper: https://www.biorxiv.org/content/10.1101/2023.09.11.557287v3
56
+
57
+ # Pythia Links
58
+ - Paper: https://arxiv.org/pdf/2304.01373
59
+ - Hugging Face: https://huggingface.co/EleutherAI/pythia-410m