SentenceTransformer based on NeuML/pubmedbert-base-embeddings
This is a sentence-transformers model finetuned from NeuML/pubmedbert-base-embeddings on the cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: NeuML/pubmedbert-base-embeddings
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
- Training Dataset:
- Language: code
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): MMContextEncoder(
(text_encoder): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0-11): 12 x BertLayer(
(attention): BertAttention(
(self): BertSdpaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(pooling): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("jo-mengr/mmcontext-pubmedbert-semantic_100k")
# Run inference
sentences = [
'Transcriptomic features: MALAT1, TMSB4X, EEF1A1, CD74, FTL, TPT1, FTH1, PTMA, NACA, TMSB10, ACTB, FOS, ATP5F1E, H3-3B, RNASET2, CYBA, JUNB, S100A6, UBA52, KLF6, ID2, TSC22D3, COX6C, PPIA, S100A10, SELENOH, VIM, PHPT1, BUD23, MYL12A, OAZ1, SPCS1, NDUFS7, DUSP1, SRP14, EMP3, PARP1, AFF3, FAU, UBC, TAGLN2, ARPC2, NAP1L1, TOMM7, CALM1, AP2S1, PKM, RHOA, HSP90AA1, COBLL1, HNRNPC, MYL6, ST13, RBX1, CTSZ, CST3, PSMD7, C19orf53, CHCHD2, SEC61B, WSB1, MS4A6A, ARPC3, GAPDH, C12orf75, C12orf57, C6orf62, C1orf56, C7orf50, C4orf3, C11orf58, C1orf21, C9orf78, C1orf43, C10orf90, C8orf34, .',
'Gene expression matches that of plasmacytoid dendritic cell cells, including: MALAT1, TMSB4X, CD74, EEF1A1, FTH1, TPT1, TMSB10, PTMA, UBA52, ACTB, FTL, FAU, ATP5F1E, CYBA, NPC2, SRP14, SERF2, CALM2, NACA, RACK1, CST3, DDX5, SRGN, PPIA, TCF4, COX4I1, KLF6, ERP29, SAMHD1, OAZ1, PFN1, VAMP8, COX7C, H3-3B, CD164, IFI44L, IRF8, CCDC50, ZFP36L2, ATP5MG, EIF1, TYROBP, VIM, RNASET2, YBX1, PABPC1, HNRNPC, MYL6, MYL12A, EDF1, PTGDS, MYL12B, DUSP1, CXCR4, LRRFIP1, PTPRE, SP110, ITM2C, LCP1, TXN, APP, PNRC1, CCDC186, C12orf75, C11orf58, C19orf53, C12orf57, C4orf3, C1orf122, C6orf62, C1orf21, C9orf78, C1orf56, C1orf43, C7orf50, C10orf90, C8orf34, .',
'The gene markers EEF1A1, ACTB, MALAT1, TMSB4X, H3-3B, TMSB10, ZFP36, DNAJB1, NFKBIA, PFN1, HSP90AA1, TPT1, PTMA, FAU, DUSP2, EIF1, BTG1, IFITM2, HSPA8, GAPDH, FTH1, NACA, RACK1, TYROBP, FTL, HSPE1, SRGN, SERF2, JUNB, BTG2, FOS, CFL1, PPIA, CYBA, PABPC1, PPP1R15A, MYL6, HSP90AB1, GADD45B, MYL12A, ATP5F1E, SH3BGRL3, IER2, JUN, CORO1A, BTF3, PNRC1, UBC, NR4A2, UBB, HOPX, CMC1, PCBP2, CALM1, RHOA, DNAJA1, OAZ1, SUB1, PPDPF, COX7C, COX4I1, ATP5MC2, IFITM3, ARPC2, C12orf57, C4orf3, C12orf75, C19orf53, C11orf58, C6orf62, C1orf21, C9orf78, C1orf56, C1orf43, C7orf50, C10orf90, C8orf34, C11orf80 suggest this cell is a natural killer cell.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6802, 0.4491],
# [0.6802, 1.0000, 0.4856],
# [0.4491, 0.4856, 1.0000]])
Evaluation
Metrics
Triplet
- Dataset:
cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation_cell_sentence_3_cell_sentence_4
- Evaluated with
TripletEvaluator
Metric | Value |
---|---|
cosine_accuracy | 0.8359 |
Training Details
Training Dataset
cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation
- Dataset: cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation at b141493
- Size: 81,143 training samples
- Columns:
anchor
,positive
,negative_1
, andnegative_2
- Approximate statistics based on the first 1000 samples:
anchor positive negative_1 negative_2 type string string string string details - min: 619 characters
- mean: 677.78 characters
- max: 770 characters
- min: 616 characters
- mean: 677.49 characters
- max: 770 characters
- min: 614 characters
- mean: 677.93 characters
- max: 760 characters
- min: 613 characters
- mean: 678.84 characters
- max: 758 characters
- Samples:
anchor positive negative_1 negative_2 This cell shows significant expression of: TMSB4X, TMSB10, ACTB, MALAT1, GNLY, NKG7, IFITM2, LGALS1, GZMA, EEF1A1, PFN1, HMGB2, FTH1, PTMA, HSP90AA1, GZMB, ARHGDIB, HNRNPA2B1, PLAAT4, FAU, CMC1, VIM, MYL12A, CBX3, ATP5F1E, HCST, IFI44L, KLRF1, H3-3A, COX6C, ARL6IP1, CFL1, ISG15, HMGB1, S100A4, ATP5MF, RORA, MYL6, CORO1A, OAZ1, KLRB1, ID2, HMGN3, CCNI, RBM39, CAP1, SERF2, ELOC, FCER1G, S100A9, IFI16, YWHAZ, EIF1, CALR, HMGN2, SKAP2, SLC25A5, ZZZ3, YBX1, NUCB2, CDC42, GSTP1, FTL, ATP5F1D, C19orf53, C11orf58, C12orf57, C9orf78, C1orf162, C1orf122, C6orf62, C1orf21, C1orf54, C1orf198, C1orf56, C1orf43, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C4orf3, C8orf34, .
lymphocyte cells are known to express: MT-CO2, TMSB4X, MALAT1, TMSB10, EEF1A1, ACTB, PTMA, PFN1, GAPDH, HMGB2, HMGB1, TMA7, GNLY, TUBA1B, TPT1, FAU, YBX1, ATP5F1E, CD52, GSTP1, GZMB, CORO1A, CALM1, HMGN2, RACK1, MYL6, BLOC1S1, S100A6, VIM, COTL1, OAZ1, HNRNPA2B1, DEK, ETS1, SERF2, SRP14, NDUFS6, GZMA, H2AZ1, EEF2, HINT1, UQCRH, SRSF10, UBA52, CD74, ENO1, HSP90AA1, HSP90AB1, ARHGDIB, COX7C, ANXA1, TXN, SNRPG, MSN, UBB, COX8A, POLR2L, UBL5, PKM, FTL, LGALS1, RBM3, EIF3E, CHCHD2, C12orf57, C19orf53, C11orf58, C7orf50, C6orf62, C9orf78, C4orf3, C12orf75, C1orf21, C1orf54, C1orf198, C1orf162, C1orf56, C1orf43, C9orf72, C4orf19, C10orf90, C21orf91, C8orf34, .
MALAT1, EEF1A1, TMSB4X, FTL, ACTB, DNAJB1, H3-3B, CD74, HSP90AA1, DUSP1, IL32, TMSB10, HSP90AB1, CD69, TPT1, BTG1, UBB, RGS1, PFN1, UBC, HSPB1, FAU, EIF1, GAPDH, SAT1, FTH1, HSPA8, HSPE1, SARAF, SERF2, TSC22D3, FOS, PTMA, NACA, CD3E, VIM, DNAJA1, ARHGDIB, CD2, CXCR4, ATP5F1E, SH3BGRL3, HSPA6, RACK1, UBA52, HERPUD1, KLF6, ITM2A, FXYD5, MYL6, CD37, OAZ1, NKG7, C12orf57, CYTIP, SRSF7, CACYBP, RGS2, TNFAIP3, SERP1, PPDPF, RAC2, COX4I1, SRRM1, C1orf43, C19orf53, C11orf58, C6orf62, C1orf21, C1orf54, C1orf198, C9orf78, C1orf162, C1orf56, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C4orf3, C8orf34, ATR are commonly found in CD8-positive, alpha-beta T cell cells.
These expression features — MALAT1, TMSB4X, EEF1A1, CD74, BTG1, PTMA, TMSB10, TPT1, FAU, EIF1, FTH1, FTL, CXCR4, TSC22D3, DUSP1, UBA52, ACTB, CD37, CD52, NACA, RACK1, EZR, CD69, LAPTM5, H3-3A, FOS, ISG20, YBX1, CIRBP, EIF3E, OAZ1, COX7C, SAT1, COX4I1, H3-3B, SH3BGRL3, UBC, UBB, JUNB, COMMD6, VIM, CYBA, KLF6, STK17B, FUS, HNRNPC, MYL6, GADD45B, LGALS1, EIF3L, SRSF5, NFKBIA, ANKRD12, CORO1A, TLE5, NOP53, CHCHD2, PFN1, DDX5, ARPC3, COX7A2, YPEL5, ARL4A, SRGN, C19orf53, C11orf58, C12orf57, C6orf62, C1orf21, C1orf54, C1orf198, C9orf78, C1orf162, C1orf56, C1orf43, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C4orf3, C8orf34 — are typical of a B cell identity.
A typical effector memory CD4-positive, alpha-beta T cell cell expresses the genes: EEF1A1, MALAT1, FTH1, JUNB, TPT1, FOS, TMSB10, BTG1, TMSB4X, ZFP36L2, NACA, PABPC1, ACTB, FAU, VIM, H3-3B, EIF1, ZFP36, SARAF, PTMA, IL7R, JUN, RACK1, EEF2, UBA52, GAPDH, FTL, FXYD5, DUSP1, S100A4, CD69, CXCR4, UBC, TSC22D3, CFL1, KLF6, ARHGDIB, KLF2, BTG2, CITED2, IER2, TUBB4B, CD3E, EEF1G, SLC2A3, NFKBIA, PFN1, SRGN, SNX9, COX4I1, DNAJB1, SERF2, CD8A, PCBP2, IL32, BIRC3, SMAP2, FUS, GADD45B, MYL12A, OAZ1, ATP5F1E, TUBA4A, C19orf53, C12orf57, C4orf3, C9orf78, C1orf162, C12orf75, C11orf58, C6orf62, C1orf21, C1orf54, C1orf198, C1orf56, C1orf43, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C8orf34, .
MALAT1, EEF1A1, TPT1, TMSB4X, ACTB, TMSB10, FAU, JUNB, RACK1, FTH1, PTMA, IL32, VIM, ZFP36L2, IL7R, S100A4, NACA, FTL, PFN1, CD52, EIF1, UBA52, EEF1G, PABPC1, SARAF, GAPDH, SH3BGRL3, EEF2, H3-3B, BTG1, TXNIP, FXYD5, MYL12A, SERF2, CFL1, CALM1, ARHGDIB, LDHB, ATP5F1E, CD3E, SLC2A3, NFKBIA, CORO1A, DDX5, HSPA8, C12orf57, COX7C, COX4I1, ITM2B, UBC, HINT1, TOMM7, PCBP2, S100A6, HSP90AA1, MYL6, HSP90AB1, NOP53, CD69, CXCR4, HNRNPA2B1, PPDPF, RAC2, PNRC1, C19orf53, C11orf58, C4orf3, C6orf62, C1orf21, C1orf54, C1orf198, C9orf78, C1orf162, C1orf56, C1orf43, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C8orf34 expression pattern defines this as a effector memory CD4-positive, alpha-beta T cell cell.
Detected gene expression: TMSB4X, ACTB, TMSB10, FTH1, FTL, EEF1A1, TPT1, UBA52, CD74, PPA1, GAPDH, TYROBP, LGALS1, PTMA, PFN1, IFI30, NACA, CD52, EIF1, EEF1G, CFL1, GSTP1, LYZ, MYL6, COX7C, TXN, SERF2, DBI, ARPC2, EEF2, CD44, RGS1, UQCR11, H3-3A, S100A11, RACK1, CYBA, YBX1, NDUFB2, CHCHD2, TPI1, NPC2, TUBA1B, COX4I1, GSN, UCP2, OST4, MARCKS, TYMP, PABPC1, ENO1, FSCN1, HSP90AA1, FKBP1A, TMEM230, RANBP1, COTL1, EIF3E, NOP53, HSPA8, COX7A2, SUB1, GBP1, TRIR, C4orf3, C19orf53, C11orf58, C12orf57, C6orf62, C1orf21, C1orf54, C1orf198, C9orf78, C1orf162, C1orf56, C1orf43, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C8orf34, .
CD74, MALAT1, EEF1A1, FOS, TPT1, TMSB4X, TMSB10, ACTB, FAU, JUN, CD37, DUSP1, RACK1, JUNB, EIF1, PTMA, FTL, DNAJB1, H3-3B, CD52, NACA, BTG1, TSC22D3, FTH1, PABPC1, EEF2, UBA52, EEF1G, HSP90AA1, LAPTM5, CYBA, PPP1R15A, HSP90AB1, CD69, ARHGDIB, ZFP36, SERF2, UBC, H3-3A, PCBP2, HLA-DRB5, KLF6, PFN1, DDX5, HSPA8, ARPC3, CD83, CCNI, CXCR4, ATP5F1E, SARAF, TUBA1A, ZFP36L1, TOMM7, HERPUD1, YBX1, RHOA, MEF2C, FXYD5, MYL6, SRSF5, MYL12A, CORO1A, OAZ1, C12orf57, C19orf53, C11orf58, C6orf62, C1orf21, C1orf54, C1orf198, C9orf78, C1orf162, C1orf56, C1orf43, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C4orf3, C8orf34 are the top expressed genes in this cell.
The expression of MALAT1, GRIK1, SYT1, PCDH9, RORA, NRG1, CADPS, ZFPM2, LRRC4C, LINGO2, RALYL, PTPRD, SPHKAP, CNTNAP5, SLC8A1, CCSER1, HDAC9, CELF2, R3HDM1, CNTN4, RBMS3, PCDH7, GALNT13, UNC5D, ROBO1, SYNPR, SNAP25, GPM6A, ANK3, FRMPD4, CHRM2, RYR2, KHDRBS2, CADM1, CACNA1D, RGS6, PDE4D, DOCK4, UNC13C, CDH18, FAT3, MEG3, NR2F2-AS1, HMCN1, GULP1, CAMK2D, ZEB1, SYN2, DYNC1I1, OXR1, DPP10, OSBPL6, FRAS1, PPP3CA, ZNF385D, ZMAT4, PCBP3, HS6ST3, ERC2, PLEKHA5, CDK14, MAP2, NCOA1, ATP8A2, C1orf21, C19orf53, C11orf58, C12orf57, C6orf62, C1orf54, C1orf198, C9orf78, C1orf162, C1orf56, C1orf43, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C4orf3, C8orf34 aligns with a neuron identity.
This cell expresses the genes: MALAT1, CDH18, RALYL, ZNF385D, CADPS2, SYT1, RIMS1, TRPM3, JMJD1C, RORA, PDE1A, CA10, FSTL5, CHN2, UNC13C, PPP3CA, GALNT13, TIAM1, ZBTB20, RELN, SNAP25, CNTN4, ANK3, RABGAP1L, RIT2, PTPRD, NRG1, NFIA, KCNJ3, ZFPM2, MCTP1, CADM1, CALN1, ZNF521, NEBL, RUNX1T1, ERC1, ABLIM1, SYNE1, NOVA1, CACNA1A, ZNF385B, GABRB2, CNKSR2, GPM6A, MAGI1, SPOCK1, CAMK4, GRM1, SYNPR, OXR1, UBE2E2, LHFPL6, KCTD8, CCSER1, KCNH7, PCLO, TCF4, RYR2, PRANCR, ETV1, TENM1, CELF2, ARID1B, C19orf53, C11orf58, C12orf57, C6orf62, C1orf21, C1orf54, C1orf198, C9orf78, C1orf162, C1orf56, C1orf43, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C4orf3, C8orf34, .
neuron cells typically express genes such as: MALAT1, NPY, GRIK1, SST, ROBO1, PCDH9, IL1RAPL2, SYT1, CCSER1, MEG3, CADPS, PTPRD, GPC6, LARGE1, PDE4D, GRIP1, KIAA1217, MEG8, EPHA6, NXPH1, PDE4B, PCDH7, DCC, GRIA3, UNC5D, NRG1, SOX6, ESRRG, PDE1A, CACNA2D3, OXR1, RAPGEF4, LINGO2, ANK3, RBMS3, RIMS1, LRRC4C, XKR4, PIP5K1B, PAM, TENM2, TCF4, RASGRF2, CHRM3, GULP1, CDH4, ZNF385D, DGKI, THSD7A, MAGI1, DAB1, PTPRM, QKI, FRMD4A, LRFN5, NELL1, MAML2, LHFPL3, CDH8, UTRN, SNAP25, ATP1B1, CAMK4, DPP10, C8orf34, C11orf58, C1orf56, C1orf43, C1orf21, C6orf62, C1orf122, C19orf53, C1orf198, C9orf78, C1orf162, C7orf50, C4orf19, C21orf91, C4orf48, C12orf57, C1orf54, .
MALAT1, PCDH9, PTPRD, NRG1, SYT1, DPP10, ROBO1, TENM2, LRRC4C, RBMS3, CNTNAP5, LINGO2, CDH18, SLC8A1, DMD, PDE4D, RYR2, ATP1B1, RGS6, PTPRT, CHRM3, ADGRL2, NOVA1, NTNG1, PCDH7, TAFA2, CCSER1, ANK3, MEG3, MAP2, PLCB4, CACNA2D1, PRKG1, LINC03000, RMST, RORA, FOXP2, LHFPL3, MEG8, TNRC6A, DAB1, KCTD8, RALYL, GNAS, INPP4B, OLFM3, CNTN4, FRMD4A, LINC00632, GAPDH, ENOX1, AHI1, GPM6A, EBF1, LRFN5, PCSK1N, SEMA5A, KIAA1217, CALY, MAP1B, SNAP25, GABRB2, CDH8, GRIP1, C8orf34, C4orf48, C19orf53, C11orf58, C1orf56, C9orf72, C1orf122, C12orf57, C6orf62, C1orf21, C1orf54, C1orf198, C9orf78, C1orf162, C1orf43, C7orf50, C4orf19, C10orf90, C21orf91, C4orf3 define the expression landscape of this cell.
- Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Evaluation Dataset
cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation
- Dataset: cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation at b141493
- Size: 9,011 evaluation samples
- Columns:
anchor
,positive
,negative_1
, andnegative_2
- Approximate statistics based on the first 1000 samples:
anchor positive negative_1 negative_2 type string string string string details - min: 563 characters
- mean: 626.08 characters
- max: 722 characters
- min: 563 characters
- mean: 626.27 characters
- max: 722 characters
- min: 558 characters
- mean: 626.66 characters
- max: 714 characters
- min: 561 characters
- mean: 628.47 characters
- max: 713 characters
- Samples:
anchor positive negative_1 negative_2 The expression pattern of MT-CO1, MALAT1, EEF1A1, FTH1, TMSB4X, ACTB, FTL, RTN4, ATP6V0B, TPT1, FAU, S100A6, NDUFA4, ATP5F1E, COX7C, ITM2B, IGFBP7, EIF1, C12orf75, CD9, COX7B, SERF2, ATP1B1, COX8A, TXNIP, NDUFB2, MYL6, PPDPF, COX6B1, UQCR11, APOE, COX4I1, CALM2, UQCRB, S100A11, UQCRQ, COX6C, ATP5MG, BSG, ATP6AP2, UQCR10, PTMA, NACA, UBL5, UBA52, TMSB10, ADGRF5, HSP90AA1, GSTP1, ATP5F1D, CHCHD2, GAPDH, COX7A2, SKP1, HSPE1, PRDX1, CYSTM1, LGALS3, CD63, ATP5MJ, CKB, NDUFS5, ATP5ME, UBB, C19orf53, C11orf58, C12orf57, C6orf62, C1orf21, C9orf78, C1orf56, C1orf43, C7orf50, C10orf90, C4orf3, C8orf34, C11orf80 strongly indicates a kidney collecting duct intercalated cell cell.
This cell shows high expression of MALAT1, MAGI1, PLCG2, FOXP1, SPP1, ARL15, NEAT1, ZBTB20, THSD7A, IGFBP7, LPP, ERBB4, STIM2, MECOM, PSD3, RNF213, ESRRG, ADGRF5, ENTREP1, TNFRSF21, PDE4D, RTN4, ITPR2, SYNE2, TMEM117, ANK3, SNTB1, STOX2, KIF13B, S100A6, TXNIP, LIMCH1, MPPED2, ACTB, JAG1, MACF1, FMNL2, LITAF, ST6GAL1, MEGF9, SHROOM3, UBC, PICALM, FNDC3B, WAC, BBX, USP9X, FGD4, PHLDB2, ZFAND3, NDUFAF2, PCDH7, SGMS1, TRIO, COBLL1, NFAT5, GLIS3, GLS, THADA, BICC1, ZSWIM6, ADAM10, BCAS3, KANSL1L, C11orf58, C6orf62, C11orf80, C12orf75, C19orf53, C12orf57, C1orf21, C9orf78, C1orf56, C1orf43, C7orf50, C10orf90, C4orf3, C8orf34, suggesting it is a kidney collecting duct intercalated cell.
This cell shows high expression of MALAT1, MAGI1, PDE4D, ZBTB20, SLC8A1, ESRRG, SPP1, MECOM, SLIT2, IGFBP7, NEAT1, ERBB4, LRMDA, PDE1C, RCAN2, ARL15, PRKG1, FHIT, NEDD4L, RORA, COBLL1, PACRG, PDE1A, PLCL1, TMEM117, ATP1B1, PTPRG, EPS8, PLCG2, NFAT5, FOXP1, RTN4, IGFBP5, LPP, NR3C2, MSI2, OXR1, FTH1, SNTB1, DSCAML1, EFNA5, IMMP2L, WWOX, THSD7A, MAP4K3, NRXN3, ARID1B, SYNE2, WNK1, HIPK2, TBC1D1, MPPED2, ADGRF5, PICALM, FTL, ITGA6, DDX17, GLIS3, AMBRA1, GLS, PARD3B, BICC1, MED13L, ITPR2, C11orf80, C12orf75, C19orf53, C11orf58, C12orf57, C6orf62, C1orf21, C9orf78, C1orf56, C1orf43, C7orf50, C10orf90, C4orf3, C8orf34, suggesting it is a kidney collecting duct intercalated cell.
A typical remaining cell_type cell expresses the genes: MT-ND3, MALAT1, EEF1A1, CRYAB, S100A6, ITM2B, ACTB, TPT1, PTMA, FTL, PEBP1, H3-3B, GSTP1, ADIRF, IGFBP7, S100A10, HIPK2, MYL6, SERF2, TPM1, FAU, FTH1, ID4, EIF1, TMSB10, HSP90AA1, SKP1, IGFBP2, IGFBP5, PRDX1, MYL12B, CYSTM1, CLU, ATP5F1E, AHNAK, PPDPF, DSTN, ID1, COX7C, JUND, SRP14, ATP1B1, HINT1, NDUFA4, PPIA, NACA, TMA7, NEAT1, CD9, SYNE2, LAPTM4A, GNAS, CIRBP, ATP5F1D, DDX17, EDF1, CCND1, LDHB, RTN4, TMEM59, NR4A1, KTN1, SAT1, TMBIM6, C18orf32, C19orf53, C11orf58, C12orf57, C6orf62, C1orf21, C9orf78, C1orf56, C1orf43, C7orf50, C10orf90, C4orf3, C8orf34, C11orf80, .
neuron cells are known to express: MALAT1, KCND2, NRXN1, CDH18, NRXN3, ZNF385D, CADM2, RALYL, NKAIN2, CADPS2, RIMS1, FSTL5, GRID2, TRPM3, CHN2, DPP6, JMJD1C, RORA, PDE1A, UNC13C, TIAM1, NRG1, SNAP25, ZFPM2, CALN1, LSAMP, CNTN1, ABLIM1, SYNE1, ANK3, CA10, NFIA, ZBTB20, NTM, CADM1, OPCML, RELN, DNM3, NEBL, ERC1, SCN2A, PPP3CA, CACNA1A, GALNT13, LRRC4C, GPM6A, RABGAP1L, RIT2, CAMK4, GRIA4, PTPRD, RBFOX3, MCTP1, LHFPL6, PCLO, MEG3, PDE10A, NOVA1, RTN1, ZNF385B, CNTN4, GABRB2, SPOCK1, OXR1, C1orf21, C8orf34, C19orf53, C11orf58, C12orf57, C6orf62, C9orf78, C1orf56, C1orf43, C7orf50, C10orf90, C4orf3, C11orf80, .
MALAT1, NRXN1, RALYL, ROBO1, GALNTL6, CADM2, LSAMP, PTPRD, CDH18, TAC1, GRID2, NRG1, NCAM2, PCDH9, CNTN4, IL1RAPL1, PCDH7, ROBO2, RORA, TENM2, LRRC4C, UNC5D, KCND2, PDE4D, PCSK1N, ZFPM2, NOVA1, MEG3, RMST, TRPM3, FRMD4A, PCDH15, DAB1, OPCML, CALY, HTR2C, ANK3, CACNA2D1, TNRC6A, AHI1, LINGO2, HS6ST3, MAP2, PPP3CA, ZFHX3, ZNF804B, RGS6, CADM1, RYR2, MEG8, LINC03051, CNTN1, SNTG1, SGCZ, SPOCK3, CALM1, CELF2, MAP1B, TMEFF2, CAMK2D, KLHL1, GRIA4, PPP2R2B, BRINP3, C8orf34, C1orf56, C4orf48, C11orf58, C6orf62, C11orf80, C18orf32, C19orf53, C12orf57, C1orf21, C9orf78, C1orf43, C7orf50, C10orf90, REL expression pattern defines this as a neuron cell.
MALAT1, IGFBP7, TMSB10, RGCC, EEF1A1, PTMA, TMSB4X, TPT1, ITM2B, ID1, VIM, IFITM3, RGS5, ACTB, TSC22D1, EIF1, FTL, H3-3B, MYL6, CALM1, GNG11, GNAS, LGALS1, ENG, CRIP2, CXCL12, SPARC, SERF2, FTH1, A2M, RACK1, CD63, SRP14, HES1, FAU, COL4A1, GAPDH, CD81, MGP, MYL12B, UBC, S100A11, UBA52, PLPP1, CALD1, APP, TAGLN2, CFL1, ANXA2, DYNLL1, TIMP3, DDX5, NACA, CD9, CAV1, PODXL, GSN, ITGB1, HMGB1, HSPA8, GNAI2, RNASE1, EMCN, GNG5, C12orf57, C11orf58, C4orf3, C1orf43, C19orf53, C6orf62, C1orf122, C7orf50, C1orf21, C9orf78, C1orf56, C10orf90 define the expression landscape of this cell.
These expression features — MALAT1, ATP10A, COBLL1, GPCPD1, PTPRG, SLC39A10, FLT1, FLI1, TSPAN5, THSD4, RUNDC3B, CCNY, IGFBP7, ST6GALNAC3, PRKCH, ST6GAL1, MECOM, ESYT2, TBC1D4, IGF1R, TACC1, HERC4, CDH2, TCF4, ABCB1, DOCK9, SORBS2, USP54, CBFA2T2, TSC22D1, QKI, EPAS1, APP, NFIB, AOPEP, ELMO1, ZNF704, PTPRM, NET1, A2M, FGD6, EPHA3, NEBL, RAPGEF2, ACVR1, SPTBN1, BBS9, KLF2, MKLN1, EXOC6, LEF1, PPP3CA, RBMS3, LRMDA, WDFY3, BCL2L1, TTC3, SIPA1L1, CFLAR, ADGRF5, MAP4K4, SCARB1, RAPGEF4, ABLIM1, C6orf62, C1orf43, C4orf3, C19orf53, C11orf58, C12orf57, C1orf21, C9orf78, C1orf56, C7orf50, C10orf90, C8orf34 — are typical of a endothelial cell identity.
Highly expressed genes: EEF1A1, ACTB, GAPDH, HMGN2, PTMA, SERF2, TMSB4X, CD74, PABPC1, FTH1, TMSB10, FAU, PFN1, HMGN1, OAZ1, HMGB1, TPT1, PPIA, NACA, BTF3, MALAT1, MYL6, ATP5MG, CFL1, RACK1, ODC1, ATP5F1E, TMA7, SLC25A5, ELOB, ARPC3, NPM1, COX7C, ANP32B, C4orf3, EIF1, PCBP2, KLF6, LAPTM5, COX8A, RHOA, HSPA8, H3-3B, PTP4A2, UBA52, OST4, CIRBP, LGALS1, EIF3L, STMN1, PPDPF, COX4I1, RAN, EIF3F, PPP1CC, COMMD6, NDUFA4, YBX1, PEBP1, COTL1, COX7A2, HSPE1, CCNI, TRIR, C11orf58, C12orf57, C9orf78, C7orf50, C11orf80, C19orf53, C6orf62, C1orf21, C1orf56, C1orf43, C10orf90, C8orf34, .
Based on transcriptomics, EEF1A1, CD74, ACTB, TPT1, GAPDH, SERF2, TMSB4X, MALAT1, OAZ1, ATP5MG, FTL, EEF2, FAU, LAPTM5, FTH1, PFN1, BTF3, EIF1, PTMA, PPIA, RACK1, TMSB10, CCNI, COX4I1, C4orf3, HMGB1, NACA, HMGN1, UBA52, PABPC1, MYL6, ATP5F1E, SEC14L1, BTG1, ATP5MC2, ARPC2, YWHAZ, CFL1, NPM1, COMMD6, PCBP2, OST4, SLC25A5, CSDE1, MEF2C, EZR, EIF3L, RBM3, CORO1A, UBE2I, METAP2, ARPC3, C12orf57, MOB4, PARK7, COX6B1, RAN, H3-3B, LCP1, SRP14, SH3BGRL3, SNRPG, EIF3H, C1orf43, C7orf50, C19orf53, C11orf58, C6orf62, C1orf21, C9orf78, C1orf56, C10orf90, C8orf34, C11orf80 are key features.
centroblast cells typically express genes such as: CD74, ACTB, EEF1A1, MALAT1, TMSB4X, GAPDH, PTMA, PFN1, SERF2, FAU, RACK1, HMGB1, PPIA, PABPC1, BTG1, EEF2, CFL1, TMSB10, ATP5MG, TPT1, HINT1, EIF1, ARPC3, BTF3, NACA, MYL6, CCNI, FTH1, MARCKSL1, NPM1, SLC25A5, CORO1A, LAPTM5, HMGN2, TSC22D3, UBE2J1, RHOA, OAZ1, ATP5MC2, ARPC2, UBA52, SERP1, EIF3H, TBCA, HSPA5, YBX1, METAP2, SNRPB, PCBP2, HERPUD1, HSP90AA1, GSTP1, DNAJC15, HNRNPA2B1, ATP5F1E, PPIB, PDIA3, GABARAP, GNG5, EIF3F, ZFP36L1, NAP1L1, TCEA1, FTL, C12orf75, C9orf78, C4orf3, C19orf53, C11orf58, C12orf57, C1orf43, C7orf50, C4orf48, C6orf62, C1orf21, C1orf56, C10orf90, C8orf34, .
High-ranking genes: CD74, MALAT1, EEF1A1, ACTB, TMSB4X, LAPTM5, PTMA, TPT1, TMSB10, CXCR4, FAU, BTG1, TXNIP, PABPC1, FTH1, NACA, FTL, IRF1, RBM3, CD83, CCNI, SARAF, BTF3, HNRNPA3, HLA-DRB5, UBA52, MEF2C, CORO1A, UBE2D3, ATP5F1E, PDIA6, UBC, GABARAP, CFL1, CALR, RACK1, HSPA5, EIF4B, RHOA, HNRNPC, SRSF5, PFN1, HSPA8, CNOT2, IFT57, HNRNPA2B1, COX7C, ITM2B, SH3BGRL3, PNRC1, PDIA3, EEF2, UBB, PARP14, SNX2, LAP3, SLC25A5, POU2F2, ADAM28, ZNF800, CYBA, GDI2, STK17B, EIF3I, C9orf78, C4orf3, C19orf53, C11orf58, C12orf57, C6orf62, C1orf21, C1orf56, C1orf43, C7orf50, C10orf90, C8orf34, .
- Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 256per_device_eval_batch_size
: 256learning_rate
: 2e-05num_train_epochs
: 4warmup_ratio
: 0.1bf16
: True
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 256per_device_eval_batch_size
: 256per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 2e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 4max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Truefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsehub_revision
: Nonegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseliger_kernel_config
: Noneeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: proportionalrouter_mapping
: {}learning_rate_mapping
: {}
Training Logs
Epoch | Step | Training Loss | cellxgene pseudo bulk 100k multiplets natural language annotation loss | cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation_cell_sentence_3_cell_sentence_4_cosine_accuracy |
---|---|---|---|---|
0.1577 | 50 | 4.9245 | - | - |
0.3155 | 100 | 4.1332 | 4.1477 | 0.8061 |
0.4732 | 150 | 3.835 | - | - |
0.6309 | 200 | 3.6548 | 3.9144 | 0.8210 |
0.7886 | 250 | 3.64 | - | - |
0.9464 | 300 | 3.5722 | 3.7670 | 0.8263 |
1.1041 | 350 | 3.4727 | - | - |
1.2618 | 400 | 3.4587 | 3.6948 | 0.8280 |
1.4196 | 450 | 3.4024 | - | - |
1.5773 | 500 | 3.3818 | 3.6560 | 0.8324 |
1.7350 | 550 | 3.3937 | - | - |
1.8927 | 600 | 3.3877 | 3.6470 | 0.8304 |
2.0505 | 650 | 3.308 | - | - |
2.2082 | 700 | 3.3111 | 3.6163 | 0.8335 |
2.3659 | 750 | 3.3042 | - | - |
2.5237 | 800 | 3.2928 | 3.5791 | 0.8345 |
2.6814 | 850 | 3.276 | - | - |
2.8391 | 900 | 3.2742 | 3.5903 | 0.8350 |
2.9968 | 950 | 3.2729 | - | - |
3.1546 | 1000 | 3.2397 | 3.5732 | 0.8360 |
3.3123 | 1050 | 3.2306 | - | - |
3.4700 | 1100 | 3.2237 | 3.5873 | 0.8349 |
3.6278 | 1150 | 3.2451 | - | - |
3.7855 | 1200 | 3.2429 | 3.5632 | 0.8359 |
3.9432 | 1250 | 3.2153 | - | - |
Framework Versions
- Python: 3.11.6
- Sentence Transformers: 5.0.0
- Transformers: 4.55.0.dev0
- PyTorch: 2.5.1+cu121
- Accelerate: 1.9.0
- Datasets: 2.19.1
- Tokenizers: 0.21.4
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Model tree for jo-mengr/mmcontext-pubmedbert-semantic_100k
Finetuned
NeuML/pubmedbert-base-embeddings
Evaluation results
- Cosine Accuracy on cellxgene pseudo bulk 100k multiplets natural language annotation cell sentence 3 cell sentence 4self-reported0.836