accphysbert_cased / README.md
thellert's picture
Update README.md
d4be083 verified
metadata
language: en
license: cc-by-4.0
tags:
  - sentence-transformers
  - feature-extraction
  - sentence-similarity
  - transformers
  - bert
  - accelerator-physics
  - physics
  - scientific-literature
  - embeddings
  - domain-specific
library_name: sentence-transformers
pipeline_tag: feature-extraction
base_model: thellert/physbert_cased
model-index:
  - name: AccPhysBERT
    results:
      - task:
          type: feature-extraction
          name: Feature Extraction
        dataset:
          name: Accelerator Physics Publications
          type: accelerator-physics
        metrics:
          - type: cosine_accuracy
            value: 0.91
            name: Citation Classification
          - type: v_measure
            value: 0.637
            name: Category Clustering (main)
          - type: ndcg_at_10
            value: 0.663
            name: Information Retrieval
datasets:
  - inspire-hep

AccPhysBERT

AccPhysBERT is a specialized sentence-embedding model fine-tuned for accelerator physics, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature.


Model Description

  • Architecture: BERT-based, fine-tuned from PhysBERT (cased) using Supervised Contrastive Learning (SimCSE).
  • Optimized For: Titles, abstracts, proposals, and full text from the accelerator-physics community.
  • Notable Features:
    • Trained on 109 k accelerator-physics publications from INSPIRE HEP
    • Leverages 690 k citation pairs and 2 M synthetic query–source pairs
    • Trained via SentenceTransformers to produce dense, semantically rich embeddings

Developed by: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro
Funded by: US Department of Energy, Lawrence Berkeley National Laboratory
Model Type: Sentence embedding (BERT-based, SimCSE fine-tuned)
Language: English
License: CC BY 4.0
Paper: Domain-specific text embedding model for accelerator physics, Phys. Rev. Accel. Beams 28, 044601 (2025)
https://doi.org/10.1103/PhysRevAccelBeams.28.044601


Training Data

  • Core Corpus:

    • 109,000 accelerator-physics publications (INSPIRE HEP category: "Accelerators")
    • Over 1 GB of full-text markdown-style text (via OCR/Nougat)
  • Annotation Sources:

    • 690,000 citation pairs
    • 49 semantic categories labeled via ChatGPT-4o
    • 2,000,000 synthetic query–source pairs generated with LLaMA3-70B

Training Procedure

  • Fine-tuning Method: SimCSE (contrastive loss)
  • Hyperparameters:
    • Batch size: 512
    • Learning rate: 2e-4
    • Temperature: 0.05
    • Weight decay: 0.01
    • Optimizer: Adam
    • Epochs: 2
  • Infrastructure: 32 × NVIDIA A100 GPUs @ NERSC
  • Framework: SentenceTransformers

Evaluation Results

Task Metric Score
Citation Classification Cosine Accuracy 91.0%
Category Clustering V‑measure (main/sub) 63.7 / 77.2
Information Retrieval nDCG@10 66.3

AccPhysBERT outperforms BERT, SciBERT, and large general-purpose embedding models in all accelerator-specific benchmarks.


Example Usage

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert")
model = AutoModel.from_pretrained("thellert/accphysbert")

text = "We report on beam instabilities observed in the LCLS-II injector."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Use mean pooling (excluding [CLS] and [SEP])
token_embeddings = outputs.last_hidden_state[:, 1:-1, :]
sentence_embedding = token_embeddings.mean(dim=1)

Citation

If you use AccPhysBERT, please cite:

@article{Hellert_2025,
  title     = {Domain-specific text embedding model for accelerator physics},
  author    = {Hellert, Thorsten and Montenegro, João and Venturini, Marco and Pollastro, Andrea},
  journal   = {Physical Review Accelerators and Beams},
  volume    = {28},
  number    = {4},
  pages     = {044601},
  year      = {2025},
  publisher = {American Physical Society},
  doi       = {10.1103/PhysRevAccelBeams.28.044601},
  url       = {https://doi.org/10.1103/PhysRevAccelBeams.28.044601}
}

Contact

Thorsten Hellert
Lawrence Berkeley National Laboratory
📧 [email protected]


Acknowledgments

This model builds on PhysBERT and was trained using NERSC resources. Thanks to Alex Hexemer, Fernando Sannibale, and Antonin Sulc for their support and discussions.