SentenceTransformer based on NeuML/pubmedbert-base-embeddings

This is a sentence-transformers model finetuned from NeuML/pubmedbert-base-embeddings on the cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): MMContextEncoder(
    (text_encoder): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0-11): 12 x BertLayer(
            (attention): BertAttention(
              (self): BertSdpaSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=768, out_features=3072, bias=True)
              (intermediate_act_fn): GELUActivation()
            )
            (output): BertOutput(
              (dense): Linear(in_features=3072, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
      )
      (pooler): BertPooler(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (activation): Tanh()
      )
    )
    (pooling): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  )
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("jo-mengr/mmcontext-pubmedbert-semantic_100k")
# Run inference
sentences = [
    'Transcriptomic features: MALAT1, TMSB4X, EEF1A1, CD74, FTL, TPT1, FTH1, PTMA, NACA, TMSB10, ACTB, FOS, ATP5F1E, H3-3B, RNASET2, CYBA, JUNB, S100A6, UBA52, KLF6, ID2, TSC22D3, COX6C, PPIA, S100A10, SELENOH, VIM, PHPT1, BUD23, MYL12A, OAZ1, SPCS1, NDUFS7, DUSP1, SRP14, EMP3, PARP1, AFF3, FAU, UBC, TAGLN2, ARPC2, NAP1L1, TOMM7, CALM1, AP2S1, PKM, RHOA, HSP90AA1, COBLL1, HNRNPC, MYL6, ST13, RBX1, CTSZ, CST3, PSMD7, C19orf53, CHCHD2, SEC61B, WSB1, MS4A6A, ARPC3, GAPDH, C12orf75, C12orf57, C6orf62, C1orf56, C7orf50, C4orf3, C11orf58, C1orf21, C9orf78, C1orf43, C10orf90, C8orf34, .',
    'Gene expression matches that of plasmacytoid dendritic cell cells, including: MALAT1, TMSB4X, CD74, EEF1A1, FTH1, TPT1, TMSB10, PTMA, UBA52, ACTB, FTL, FAU, ATP5F1E, CYBA, NPC2, SRP14, SERF2, CALM2, NACA, RACK1, CST3, DDX5, SRGN, PPIA, TCF4, COX4I1, KLF6, ERP29, SAMHD1, OAZ1, PFN1, VAMP8, COX7C, H3-3B, CD164, IFI44L, IRF8, CCDC50, ZFP36L2, ATP5MG, EIF1, TYROBP, VIM, RNASET2, YBX1, PABPC1, HNRNPC, MYL6, MYL12A, EDF1, PTGDS, MYL12B, DUSP1, CXCR4, LRRFIP1, PTPRE, SP110, ITM2C, LCP1, TXN, APP, PNRC1, CCDC186, C12orf75, C11orf58, C19orf53, C12orf57, C4orf3, C1orf122, C6orf62, C1orf21, C9orf78, C1orf56, C1orf43, C7orf50, C10orf90, C8orf34, .',
    'The gene markers EEF1A1, ACTB, MALAT1, TMSB4X, H3-3B, TMSB10, ZFP36, DNAJB1, NFKBIA, PFN1, HSP90AA1, TPT1, PTMA, FAU, DUSP2, EIF1, BTG1, IFITM2, HSPA8, GAPDH, FTH1, NACA, RACK1, TYROBP, FTL, HSPE1, SRGN, SERF2, JUNB, BTG2, FOS, CFL1, PPIA, CYBA, PABPC1, PPP1R15A, MYL6, HSP90AB1, GADD45B, MYL12A, ATP5F1E, SH3BGRL3, IER2, JUN, CORO1A, BTF3, PNRC1, UBC, NR4A2, UBB, HOPX, CMC1, PCBP2, CALM1, RHOA, DNAJA1, OAZ1, SUB1, PPDPF, COX7C, COX4I1, ATP5MC2, IFITM3, ARPC2, C12orf57, C4orf3, C12orf75, C19orf53, C11orf58, C6orf62, C1orf21, C9orf78, C1orf56, C1orf43, C7orf50, C10orf90, C8orf34, C11orf80 suggest this cell is a natural killer cell.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6802, 0.4491],
#         [0.6802, 1.0000, 0.4856],
#         [0.4491, 0.4856, 1.0000]])

Evaluation

Metrics

Triplet

  • Dataset: cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation_cell_sentence_3_cell_sentence_4
  • Evaluated with TripletEvaluator
Metric Value
cosine_accuracy 0.8359

Training Details

Training Dataset

cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation

  • Dataset: cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation at b141493
  • Size: 81,143 training samples
  • Columns: anchor, positive, negative_1, and negative_2
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative_1 negative_2
    type string string string string
    details
    • min: 619 characters
    • mean: 677.78 characters
    • max: 770 characters
    • min: 616 characters
    • mean: 677.49 characters
    • max: 770 characters
    • min: 614 characters
    • mean: 677.93 characters
    • max: 760 characters
    • min: 613 characters
    • mean: 678.84 characters
    • max: 758 characters
  • Samples:
    anchor positive negative_1 negative_2
    This cell shows significant expression of: TMSB4X, TMSB10, ACTB, MALAT1, GNLY, NKG7, IFITM2, LGALS1, GZMA, EEF1A1, PFN1, HMGB2, FTH1, PTMA, HSP90AA1, GZMB, ARHGDIB, HNRNPA2B1, PLAAT4, FAU, CMC1, VIM, MYL12A, CBX3, ATP5F1E, HCST, IFI44L, KLRF1, H3-3A, COX6C, ARL6IP1, CFL1, ISG15, HMGB1, S100A4, ATP5MF, RORA, MYL6, CORO1A, OAZ1, KLRB1, ID2, HMGN3, CCNI, RBM39, CAP1, SERF2, ELOC, FCER1G, S100A9, IFI16, YWHAZ, EIF1, CALR, HMGN2, SKAP2, SLC25A5, ZZZ3, YBX1, NUCB2, CDC42, GSTP1, FTL, ATP5F1D, C19orf53, C11orf58, C12orf57, C9orf78, C1orf162, C1orf122, C6orf62, C1orf21, C1orf54, C1orf198, C1orf56, C1orf43, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C4orf3, C8orf34, . lymphocyte cells are known to express: MT-CO2, TMSB4X, MALAT1, TMSB10, EEF1A1, ACTB, PTMA, PFN1, GAPDH, HMGB2, HMGB1, TMA7, GNLY, TUBA1B, TPT1, FAU, YBX1, ATP5F1E, CD52, GSTP1, GZMB, CORO1A, CALM1, HMGN2, RACK1, MYL6, BLOC1S1, S100A6, VIM, COTL1, OAZ1, HNRNPA2B1, DEK, ETS1, SERF2, SRP14, NDUFS6, GZMA, H2AZ1, EEF2, HINT1, UQCRH, SRSF10, UBA52, CD74, ENO1, HSP90AA1, HSP90AB1, ARHGDIB, COX7C, ANXA1, TXN, SNRPG, MSN, UBB, COX8A, POLR2L, UBL5, PKM, FTL, LGALS1, RBM3, EIF3E, CHCHD2, C12orf57, C19orf53, C11orf58, C7orf50, C6orf62, C9orf78, C4orf3, C12orf75, C1orf21, C1orf54, C1orf198, C1orf162, C1orf56, C1orf43, C9orf72, C4orf19, C10orf90, C21orf91, C8orf34, . MALAT1, EEF1A1, TMSB4X, FTL, ACTB, DNAJB1, H3-3B, CD74, HSP90AA1, DUSP1, IL32, TMSB10, HSP90AB1, CD69, TPT1, BTG1, UBB, RGS1, PFN1, UBC, HSPB1, FAU, EIF1, GAPDH, SAT1, FTH1, HSPA8, HSPE1, SARAF, SERF2, TSC22D3, FOS, PTMA, NACA, CD3E, VIM, DNAJA1, ARHGDIB, CD2, CXCR4, ATP5F1E, SH3BGRL3, HSPA6, RACK1, UBA52, HERPUD1, KLF6, ITM2A, FXYD5, MYL6, CD37, OAZ1, NKG7, C12orf57, CYTIP, SRSF7, CACYBP, RGS2, TNFAIP3, SERP1, PPDPF, RAC2, COX4I1, SRRM1, C1orf43, C19orf53, C11orf58, C6orf62, C1orf21, C1orf54, C1orf198, C9orf78, C1orf162, C1orf56, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C4orf3, C8orf34, ATR are commonly found in CD8-positive, alpha-beta T cell cells. These expression features — MALAT1, TMSB4X, EEF1A1, CD74, BTG1, PTMA, TMSB10, TPT1, FAU, EIF1, FTH1, FTL, CXCR4, TSC22D3, DUSP1, UBA52, ACTB, CD37, CD52, NACA, RACK1, EZR, CD69, LAPTM5, H3-3A, FOS, ISG20, YBX1, CIRBP, EIF3E, OAZ1, COX7C, SAT1, COX4I1, H3-3B, SH3BGRL3, UBC, UBB, JUNB, COMMD6, VIM, CYBA, KLF6, STK17B, FUS, HNRNPC, MYL6, GADD45B, LGALS1, EIF3L, SRSF5, NFKBIA, ANKRD12, CORO1A, TLE5, NOP53, CHCHD2, PFN1, DDX5, ARPC3, COX7A2, YPEL5, ARL4A, SRGN, C19orf53, C11orf58, C12orf57, C6orf62, C1orf21, C1orf54, C1orf198, C9orf78, C1orf162, C1orf56, C1orf43, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C4orf3, C8orf34 — are typical of a B cell identity.
    A typical effector memory CD4-positive, alpha-beta T cell cell expresses the genes: EEF1A1, MALAT1, FTH1, JUNB, TPT1, FOS, TMSB10, BTG1, TMSB4X, ZFP36L2, NACA, PABPC1, ACTB, FAU, VIM, H3-3B, EIF1, ZFP36, SARAF, PTMA, IL7R, JUN, RACK1, EEF2, UBA52, GAPDH, FTL, FXYD5, DUSP1, S100A4, CD69, CXCR4, UBC, TSC22D3, CFL1, KLF6, ARHGDIB, KLF2, BTG2, CITED2, IER2, TUBB4B, CD3E, EEF1G, SLC2A3, NFKBIA, PFN1, SRGN, SNX9, COX4I1, DNAJB1, SERF2, CD8A, PCBP2, IL32, BIRC3, SMAP2, FUS, GADD45B, MYL12A, OAZ1, ATP5F1E, TUBA4A, C19orf53, C12orf57, C4orf3, C9orf78, C1orf162, C12orf75, C11orf58, C6orf62, C1orf21, C1orf54, C1orf198, C1orf56, C1orf43, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C8orf34, . MALAT1, EEF1A1, TPT1, TMSB4X, ACTB, TMSB10, FAU, JUNB, RACK1, FTH1, PTMA, IL32, VIM, ZFP36L2, IL7R, S100A4, NACA, FTL, PFN1, CD52, EIF1, UBA52, EEF1G, PABPC1, SARAF, GAPDH, SH3BGRL3, EEF2, H3-3B, BTG1, TXNIP, FXYD5, MYL12A, SERF2, CFL1, CALM1, ARHGDIB, LDHB, ATP5F1E, CD3E, SLC2A3, NFKBIA, CORO1A, DDX5, HSPA8, C12orf57, COX7C, COX4I1, ITM2B, UBC, HINT1, TOMM7, PCBP2, S100A6, HSP90AA1, MYL6, HSP90AB1, NOP53, CD69, CXCR4, HNRNPA2B1, PPDPF, RAC2, PNRC1, C19orf53, C11orf58, C4orf3, C6orf62, C1orf21, C1orf54, C1orf198, C9orf78, C1orf162, C1orf56, C1orf43, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C8orf34 expression pattern defines this as a effector memory CD4-positive, alpha-beta T cell cell. Detected gene expression: TMSB4X, ACTB, TMSB10, FTH1, FTL, EEF1A1, TPT1, UBA52, CD74, PPA1, GAPDH, TYROBP, LGALS1, PTMA, PFN1, IFI30, NACA, CD52, EIF1, EEF1G, CFL1, GSTP1, LYZ, MYL6, COX7C, TXN, SERF2, DBI, ARPC2, EEF2, CD44, RGS1, UQCR11, H3-3A, S100A11, RACK1, CYBA, YBX1, NDUFB2, CHCHD2, TPI1, NPC2, TUBA1B, COX4I1, GSN, UCP2, OST4, MARCKS, TYMP, PABPC1, ENO1, FSCN1, HSP90AA1, FKBP1A, TMEM230, RANBP1, COTL1, EIF3E, NOP53, HSPA8, COX7A2, SUB1, GBP1, TRIR, C4orf3, C19orf53, C11orf58, C12orf57, C6orf62, C1orf21, C1orf54, C1orf198, C9orf78, C1orf162, C1orf56, C1orf43, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C8orf34, . CD74, MALAT1, EEF1A1, FOS, TPT1, TMSB4X, TMSB10, ACTB, FAU, JUN, CD37, DUSP1, RACK1, JUNB, EIF1, PTMA, FTL, DNAJB1, H3-3B, CD52, NACA, BTG1, TSC22D3, FTH1, PABPC1, EEF2, UBA52, EEF1G, HSP90AA1, LAPTM5, CYBA, PPP1R15A, HSP90AB1, CD69, ARHGDIB, ZFP36, SERF2, UBC, H3-3A, PCBP2, HLA-DRB5, KLF6, PFN1, DDX5, HSPA8, ARPC3, CD83, CCNI, CXCR4, ATP5F1E, SARAF, TUBA1A, ZFP36L1, TOMM7, HERPUD1, YBX1, RHOA, MEF2C, FXYD5, MYL6, SRSF5, MYL12A, CORO1A, OAZ1, C12orf57, C19orf53, C11orf58, C6orf62, C1orf21, C1orf54, C1orf198, C9orf78, C1orf162, C1orf56, C1orf43, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C4orf3, C8orf34 are the top expressed genes in this cell.
    The expression of MALAT1, GRIK1, SYT1, PCDH9, RORA, NRG1, CADPS, ZFPM2, LRRC4C, LINGO2, RALYL, PTPRD, SPHKAP, CNTNAP5, SLC8A1, CCSER1, HDAC9, CELF2, R3HDM1, CNTN4, RBMS3, PCDH7, GALNT13, UNC5D, ROBO1, SYNPR, SNAP25, GPM6A, ANK3, FRMPD4, CHRM2, RYR2, KHDRBS2, CADM1, CACNA1D, RGS6, PDE4D, DOCK4, UNC13C, CDH18, FAT3, MEG3, NR2F2-AS1, HMCN1, GULP1, CAMK2D, ZEB1, SYN2, DYNC1I1, OXR1, DPP10, OSBPL6, FRAS1, PPP3CA, ZNF385D, ZMAT4, PCBP3, HS6ST3, ERC2, PLEKHA5, CDK14, MAP2, NCOA1, ATP8A2, C1orf21, C19orf53, C11orf58, C12orf57, C6orf62, C1orf54, C1orf198, C9orf78, C1orf162, C1orf56, C1orf43, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C4orf3, C8orf34 aligns with a neuron identity. This cell expresses the genes: MALAT1, CDH18, RALYL, ZNF385D, CADPS2, SYT1, RIMS1, TRPM3, JMJD1C, RORA, PDE1A, CA10, FSTL5, CHN2, UNC13C, PPP3CA, GALNT13, TIAM1, ZBTB20, RELN, SNAP25, CNTN4, ANK3, RABGAP1L, RIT2, PTPRD, NRG1, NFIA, KCNJ3, ZFPM2, MCTP1, CADM1, CALN1, ZNF521, NEBL, RUNX1T1, ERC1, ABLIM1, SYNE1, NOVA1, CACNA1A, ZNF385B, GABRB2, CNKSR2, GPM6A, MAGI1, SPOCK1, CAMK4, GRM1, SYNPR, OXR1, UBE2E2, LHFPL6, KCTD8, CCSER1, KCNH7, PCLO, TCF4, RYR2, PRANCR, ETV1, TENM1, CELF2, ARID1B, C19orf53, C11orf58, C12orf57, C6orf62, C1orf21, C1orf54, C1orf198, C9orf78, C1orf162, C1orf56, C1orf43, C7orf50, C9orf72, C4orf19, C10orf90, C21orf91, C4orf3, C8orf34, . neuron cells typically express genes such as: MALAT1, NPY, GRIK1, SST, ROBO1, PCDH9, IL1RAPL2, SYT1, CCSER1, MEG3, CADPS, PTPRD, GPC6, LARGE1, PDE4D, GRIP1, KIAA1217, MEG8, EPHA6, NXPH1, PDE4B, PCDH7, DCC, GRIA3, UNC5D, NRG1, SOX6, ESRRG, PDE1A, CACNA2D3, OXR1, RAPGEF4, LINGO2, ANK3, RBMS3, RIMS1, LRRC4C, XKR4, PIP5K1B, PAM, TENM2, TCF4, RASGRF2, CHRM3, GULP1, CDH4, ZNF385D, DGKI, THSD7A, MAGI1, DAB1, PTPRM, QKI, FRMD4A, LRFN5, NELL1, MAML2, LHFPL3, CDH8, UTRN, SNAP25, ATP1B1, CAMK4, DPP10, C8orf34, C11orf58, C1orf56, C1orf43, C1orf21, C6orf62, C1orf122, C19orf53, C1orf198, C9orf78, C1orf162, C7orf50, C4orf19, C21orf91, C4orf48, C12orf57, C1orf54, . MALAT1, PCDH9, PTPRD, NRG1, SYT1, DPP10, ROBO1, TENM2, LRRC4C, RBMS3, CNTNAP5, LINGO2, CDH18, SLC8A1, DMD, PDE4D, RYR2, ATP1B1, RGS6, PTPRT, CHRM3, ADGRL2, NOVA1, NTNG1, PCDH7, TAFA2, CCSER1, ANK3, MEG3, MAP2, PLCB4, CACNA2D1, PRKG1, LINC03000, RMST, RORA, FOXP2, LHFPL3, MEG8, TNRC6A, DAB1, KCTD8, RALYL, GNAS, INPP4B, OLFM3, CNTN4, FRMD4A, LINC00632, GAPDH, ENOX1, AHI1, GPM6A, EBF1, LRFN5, PCSK1N, SEMA5A, KIAA1217, CALY, MAP1B, SNAP25, GABRB2, CDH8, GRIP1, C8orf34, C4orf48, C19orf53, C11orf58, C1orf56, C9orf72, C1orf122, C12orf57, C6orf62, C1orf21, C1orf54, C1orf198, C9orf78, C1orf162, C1orf43, C7orf50, C4orf19, C10orf90, C21orf91, C4orf3 define the expression landscape of this cell.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation

  • Dataset: cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation at b141493
  • Size: 9,011 evaluation samples
  • Columns: anchor, positive, negative_1, and negative_2
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative_1 negative_2
    type string string string string
    details
    • min: 563 characters
    • mean: 626.08 characters
    • max: 722 characters
    • min: 563 characters
    • mean: 626.27 characters
    • max: 722 characters
    • min: 558 characters
    • mean: 626.66 characters
    • max: 714 characters
    • min: 561 characters
    • mean: 628.47 characters
    • max: 713 characters
  • Samples:
    anchor positive negative_1 negative_2
    The expression pattern of MT-CO1, MALAT1, EEF1A1, FTH1, TMSB4X, ACTB, FTL, RTN4, ATP6V0B, TPT1, FAU, S100A6, NDUFA4, ATP5F1E, COX7C, ITM2B, IGFBP7, EIF1, C12orf75, CD9, COX7B, SERF2, ATP1B1, COX8A, TXNIP, NDUFB2, MYL6, PPDPF, COX6B1, UQCR11, APOE, COX4I1, CALM2, UQCRB, S100A11, UQCRQ, COX6C, ATP5MG, BSG, ATP6AP2, UQCR10, PTMA, NACA, UBL5, UBA52, TMSB10, ADGRF5, HSP90AA1, GSTP1, ATP5F1D, CHCHD2, GAPDH, COX7A2, SKP1, HSPE1, PRDX1, CYSTM1, LGALS3, CD63, ATP5MJ, CKB, NDUFS5, ATP5ME, UBB, C19orf53, C11orf58, C12orf57, C6orf62, C1orf21, C9orf78, C1orf56, C1orf43, C7orf50, C10orf90, C4orf3, C8orf34, C11orf80 strongly indicates a kidney collecting duct intercalated cell cell. This cell shows high expression of MALAT1, MAGI1, PLCG2, FOXP1, SPP1, ARL15, NEAT1, ZBTB20, THSD7A, IGFBP7, LPP, ERBB4, STIM2, MECOM, PSD3, RNF213, ESRRG, ADGRF5, ENTREP1, TNFRSF21, PDE4D, RTN4, ITPR2, SYNE2, TMEM117, ANK3, SNTB1, STOX2, KIF13B, S100A6, TXNIP, LIMCH1, MPPED2, ACTB, JAG1, MACF1, FMNL2, LITAF, ST6GAL1, MEGF9, SHROOM3, UBC, PICALM, FNDC3B, WAC, BBX, USP9X, FGD4, PHLDB2, ZFAND3, NDUFAF2, PCDH7, SGMS1, TRIO, COBLL1, NFAT5, GLIS3, GLS, THADA, BICC1, ZSWIM6, ADAM10, BCAS3, KANSL1L, C11orf58, C6orf62, C11orf80, C12orf75, C19orf53, C12orf57, C1orf21, C9orf78, C1orf56, C1orf43, C7orf50, C10orf90, C4orf3, C8orf34, suggesting it is a kidney collecting duct intercalated cell. This cell shows high expression of MALAT1, MAGI1, PDE4D, ZBTB20, SLC8A1, ESRRG, SPP1, MECOM, SLIT2, IGFBP7, NEAT1, ERBB4, LRMDA, PDE1C, RCAN2, ARL15, PRKG1, FHIT, NEDD4L, RORA, COBLL1, PACRG, PDE1A, PLCL1, TMEM117, ATP1B1, PTPRG, EPS8, PLCG2, NFAT5, FOXP1, RTN4, IGFBP5, LPP, NR3C2, MSI2, OXR1, FTH1, SNTB1, DSCAML1, EFNA5, IMMP2L, WWOX, THSD7A, MAP4K3, NRXN3, ARID1B, SYNE2, WNK1, HIPK2, TBC1D1, MPPED2, ADGRF5, PICALM, FTL, ITGA6, DDX17, GLIS3, AMBRA1, GLS, PARD3B, BICC1, MED13L, ITPR2, C11orf80, C12orf75, C19orf53, C11orf58, C12orf57, C6orf62, C1orf21, C9orf78, C1orf56, C1orf43, C7orf50, C10orf90, C4orf3, C8orf34, suggesting it is a kidney collecting duct intercalated cell. A typical remaining cell_type cell expresses the genes: MT-ND3, MALAT1, EEF1A1, CRYAB, S100A6, ITM2B, ACTB, TPT1, PTMA, FTL, PEBP1, H3-3B, GSTP1, ADIRF, IGFBP7, S100A10, HIPK2, MYL6, SERF2, TPM1, FAU, FTH1, ID4, EIF1, TMSB10, HSP90AA1, SKP1, IGFBP2, IGFBP5, PRDX1, MYL12B, CYSTM1, CLU, ATP5F1E, AHNAK, PPDPF, DSTN, ID1, COX7C, JUND, SRP14, ATP1B1, HINT1, NDUFA4, PPIA, NACA, TMA7, NEAT1, CD9, SYNE2, LAPTM4A, GNAS, CIRBP, ATP5F1D, DDX17, EDF1, CCND1, LDHB, RTN4, TMEM59, NR4A1, KTN1, SAT1, TMBIM6, C18orf32, C19orf53, C11orf58, C12orf57, C6orf62, C1orf21, C9orf78, C1orf56, C1orf43, C7orf50, C10orf90, C4orf3, C8orf34, C11orf80, .
    neuron cells are known to express: MALAT1, KCND2, NRXN1, CDH18, NRXN3, ZNF385D, CADM2, RALYL, NKAIN2, CADPS2, RIMS1, FSTL5, GRID2, TRPM3, CHN2, DPP6, JMJD1C, RORA, PDE1A, UNC13C, TIAM1, NRG1, SNAP25, ZFPM2, CALN1, LSAMP, CNTN1, ABLIM1, SYNE1, ANK3, CA10, NFIA, ZBTB20, NTM, CADM1, OPCML, RELN, DNM3, NEBL, ERC1, SCN2A, PPP3CA, CACNA1A, GALNT13, LRRC4C, GPM6A, RABGAP1L, RIT2, CAMK4, GRIA4, PTPRD, RBFOX3, MCTP1, LHFPL6, PCLO, MEG3, PDE10A, NOVA1, RTN1, ZNF385B, CNTN4, GABRB2, SPOCK1, OXR1, C1orf21, C8orf34, C19orf53, C11orf58, C12orf57, C6orf62, C9orf78, C1orf56, C1orf43, C7orf50, C10orf90, C4orf3, C11orf80, . MALAT1, NRXN1, RALYL, ROBO1, GALNTL6, CADM2, LSAMP, PTPRD, CDH18, TAC1, GRID2, NRG1, NCAM2, PCDH9, CNTN4, IL1RAPL1, PCDH7, ROBO2, RORA, TENM2, LRRC4C, UNC5D, KCND2, PDE4D, PCSK1N, ZFPM2, NOVA1, MEG3, RMST, TRPM3, FRMD4A, PCDH15, DAB1, OPCML, CALY, HTR2C, ANK3, CACNA2D1, TNRC6A, AHI1, LINGO2, HS6ST3, MAP2, PPP3CA, ZFHX3, ZNF804B, RGS6, CADM1, RYR2, MEG8, LINC03051, CNTN1, SNTG1, SGCZ, SPOCK3, CALM1, CELF2, MAP1B, TMEFF2, CAMK2D, KLHL1, GRIA4, PPP2R2B, BRINP3, C8orf34, C1orf56, C4orf48, C11orf58, C6orf62, C11orf80, C18orf32, C19orf53, C12orf57, C1orf21, C9orf78, C1orf43, C7orf50, C10orf90, REL expression pattern defines this as a neuron cell. MALAT1, IGFBP7, TMSB10, RGCC, EEF1A1, PTMA, TMSB4X, TPT1, ITM2B, ID1, VIM, IFITM3, RGS5, ACTB, TSC22D1, EIF1, FTL, H3-3B, MYL6, CALM1, GNG11, GNAS, LGALS1, ENG, CRIP2, CXCL12, SPARC, SERF2, FTH1, A2M, RACK1, CD63, SRP14, HES1, FAU, COL4A1, GAPDH, CD81, MGP, MYL12B, UBC, S100A11, UBA52, PLPP1, CALD1, APP, TAGLN2, CFL1, ANXA2, DYNLL1, TIMP3, DDX5, NACA, CD9, CAV1, PODXL, GSN, ITGB1, HMGB1, HSPA8, GNAI2, RNASE1, EMCN, GNG5, C12orf57, C11orf58, C4orf3, C1orf43, C19orf53, C6orf62, C1orf122, C7orf50, C1orf21, C9orf78, C1orf56, C10orf90 define the expression landscape of this cell. These expression features — MALAT1, ATP10A, COBLL1, GPCPD1, PTPRG, SLC39A10, FLT1, FLI1, TSPAN5, THSD4, RUNDC3B, CCNY, IGFBP7, ST6GALNAC3, PRKCH, ST6GAL1, MECOM, ESYT2, TBC1D4, IGF1R, TACC1, HERC4, CDH2, TCF4, ABCB1, DOCK9, SORBS2, USP54, CBFA2T2, TSC22D1, QKI, EPAS1, APP, NFIB, AOPEP, ELMO1, ZNF704, PTPRM, NET1, A2M, FGD6, EPHA3, NEBL, RAPGEF2, ACVR1, SPTBN1, BBS9, KLF2, MKLN1, EXOC6, LEF1, PPP3CA, RBMS3, LRMDA, WDFY3, BCL2L1, TTC3, SIPA1L1, CFLAR, ADGRF5, MAP4K4, SCARB1, RAPGEF4, ABLIM1, C6orf62, C1orf43, C4orf3, C19orf53, C11orf58, C12orf57, C1orf21, C9orf78, C1orf56, C7orf50, C10orf90, C8orf34 — are typical of a endothelial cell identity.
    Highly expressed genes: EEF1A1, ACTB, GAPDH, HMGN2, PTMA, SERF2, TMSB4X, CD74, PABPC1, FTH1, TMSB10, FAU, PFN1, HMGN1, OAZ1, HMGB1, TPT1, PPIA, NACA, BTF3, MALAT1, MYL6, ATP5MG, CFL1, RACK1, ODC1, ATP5F1E, TMA7, SLC25A5, ELOB, ARPC3, NPM1, COX7C, ANP32B, C4orf3, EIF1, PCBP2, KLF6, LAPTM5, COX8A, RHOA, HSPA8, H3-3B, PTP4A2, UBA52, OST4, CIRBP, LGALS1, EIF3L, STMN1, PPDPF, COX4I1, RAN, EIF3F, PPP1CC, COMMD6, NDUFA4, YBX1, PEBP1, COTL1, COX7A2, HSPE1, CCNI, TRIR, C11orf58, C12orf57, C9orf78, C7orf50, C11orf80, C19orf53, C6orf62, C1orf21, C1orf56, C1orf43, C10orf90, C8orf34, . Based on transcriptomics, EEF1A1, CD74, ACTB, TPT1, GAPDH, SERF2, TMSB4X, MALAT1, OAZ1, ATP5MG, FTL, EEF2, FAU, LAPTM5, FTH1, PFN1, BTF3, EIF1, PTMA, PPIA, RACK1, TMSB10, CCNI, COX4I1, C4orf3, HMGB1, NACA, HMGN1, UBA52, PABPC1, MYL6, ATP5F1E, SEC14L1, BTG1, ATP5MC2, ARPC2, YWHAZ, CFL1, NPM1, COMMD6, PCBP2, OST4, SLC25A5, CSDE1, MEF2C, EZR, EIF3L, RBM3, CORO1A, UBE2I, METAP2, ARPC3, C12orf57, MOB4, PARK7, COX6B1, RAN, H3-3B, LCP1, SRP14, SH3BGRL3, SNRPG, EIF3H, C1orf43, C7orf50, C19orf53, C11orf58, C6orf62, C1orf21, C9orf78, C1orf56, C10orf90, C8orf34, C11orf80 are key features. centroblast cells typically express genes such as: CD74, ACTB, EEF1A1, MALAT1, TMSB4X, GAPDH, PTMA, PFN1, SERF2, FAU, RACK1, HMGB1, PPIA, PABPC1, BTG1, EEF2, CFL1, TMSB10, ATP5MG, TPT1, HINT1, EIF1, ARPC3, BTF3, NACA, MYL6, CCNI, FTH1, MARCKSL1, NPM1, SLC25A5, CORO1A, LAPTM5, HMGN2, TSC22D3, UBE2J1, RHOA, OAZ1, ATP5MC2, ARPC2, UBA52, SERP1, EIF3H, TBCA, HSPA5, YBX1, METAP2, SNRPB, PCBP2, HERPUD1, HSP90AA1, GSTP1, DNAJC15, HNRNPA2B1, ATP5F1E, PPIB, PDIA3, GABARAP, GNG5, EIF3F, ZFP36L1, NAP1L1, TCEA1, FTL, C12orf75, C9orf78, C4orf3, C19orf53, C11orf58, C12orf57, C1orf43, C7orf50, C4orf48, C6orf62, C1orf21, C1orf56, C10orf90, C8orf34, . High-ranking genes: CD74, MALAT1, EEF1A1, ACTB, TMSB4X, LAPTM5, PTMA, TPT1, TMSB10, CXCR4, FAU, BTG1, TXNIP, PABPC1, FTH1, NACA, FTL, IRF1, RBM3, CD83, CCNI, SARAF, BTF3, HNRNPA3, HLA-DRB5, UBA52, MEF2C, CORO1A, UBE2D3, ATP5F1E, PDIA6, UBC, GABARAP, CFL1, CALR, RACK1, HSPA5, EIF4B, RHOA, HNRNPC, SRSF5, PFN1, HSPA8, CNOT2, IFT57, HNRNPA2B1, COX7C, ITM2B, SH3BGRL3, PNRC1, PDIA3, EEF2, UBB, PARP14, SNX2, LAP3, SLC25A5, POU2F2, ADAM28, ZNF800, CYBA, GDI2, STK17B, EIF3I, C9orf78, C4orf3, C19orf53, C11orf58, C12orf57, C6orf62, C1orf21, C1orf56, C1orf43, C7orf50, C10orf90, C8orf34, .
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 256
  • per_device_eval_batch_size: 256
  • learning_rate: 2e-05
  • num_train_epochs: 4
  • warmup_ratio: 0.1
  • bf16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 256
  • per_device_eval_batch_size: 256
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss cellxgene pseudo bulk 100k multiplets natural language annotation loss cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation_cell_sentence_3_cell_sentence_4_cosine_accuracy
0.1577 50 4.9245 - -
0.3155 100 4.1332 4.1477 0.8061
0.4732 150 3.835 - -
0.6309 200 3.6548 3.9144 0.8210
0.7886 250 3.64 - -
0.9464 300 3.5722 3.7670 0.8263
1.1041 350 3.4727 - -
1.2618 400 3.4587 3.6948 0.8280
1.4196 450 3.4024 - -
1.5773 500 3.3818 3.6560 0.8324
1.7350 550 3.3937 - -
1.8927 600 3.3877 3.6470 0.8304
2.0505 650 3.308 - -
2.2082 700 3.3111 3.6163 0.8335
2.3659 750 3.3042 - -
2.5237 800 3.2928 3.5791 0.8345
2.6814 850 3.276 - -
2.8391 900 3.2742 3.5903 0.8350
2.9968 950 3.2729 - -
3.1546 1000 3.2397 3.5732 0.8360
3.3123 1050 3.2306 - -
3.4700 1100 3.2237 3.5873 0.8349
3.6278 1150 3.2451 - -
3.7855 1200 3.2429 3.5632 0.8359
3.9432 1250 3.2153 - -

Framework Versions

  • Python: 3.11.6
  • Sentence Transformers: 5.0.0
  • Transformers: 4.55.0.dev0
  • PyTorch: 2.5.1+cu121
  • Accelerate: 1.9.0
  • Datasets: 2.19.1
  • Tokenizers: 0.21.4

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jo-mengr/mmcontext-pubmedbert-semantic_100k

Evaluation results

  • Cosine Accuracy on cellxgene pseudo bulk 100k multiplets natural language annotation cell sentence 3 cell sentence 4
    self-reported
    0.836