AccPhysBERT

AccPhysBERT is a specialized sentence-embedding model fine-tuned for accelerator physics, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature.


Model Description

  • Architecture: BERT-based, fine-tuned from PhysBERT (cased) using Supervised Contrastive Learning (SimCSE).
  • Optimized For: Titles, abstracts, proposals, and full text from the accelerator-physics community.
  • Notable Features:
    • Trained on 109 k accelerator-physics publications from INSPIRE HEP
    • Leverages 690 k citation pairs and 2 M synthetic query–source pairs
    • Trained via SentenceTransformers to produce dense, semantically rich embeddings

Developed by: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro
Funded by: US Department of Energy, Lawrence Berkeley National Laboratory
Model Type: Sentence embedding (BERT-based, SimCSE fine-tuned)
Language: English
License: CC BY 4.0
Paper: Domain-specific text embedding model for accelerator physics, Phys. Rev. Accel. Beams 28, 044601 (2025)
https://doi.org/10.1103/PhysRevAccelBeams.28.044601


Training Data

  • Core Corpus:

    • 109,000 accelerator-physics publications (INSPIRE HEP category: "Accelerators")
    • Over 1 GB of full-text markdown-style text (via OCR/Nougat)
  • Annotation Sources:

    • 690,000 citation pairs
    • 49 semantic categories labeled via ChatGPT-4o
    • 2,000,000 synthetic query–source pairs generated with LLaMA3-70B

Training Procedure

  • Fine-tuning Method: SimCSE (contrastive loss)
  • Hyperparameters:
    • Batch size: 512
    • Learning rate: 2e-4
    • Temperature: 0.05
    • Weight decay: 0.01
    • Optimizer: Adam
    • Epochs: 2
  • Infrastructure: 32 × NVIDIA A100 GPUs @ NERSC
  • Framework: SentenceTransformers

Evaluation Results

Task Metric Score
Citation Classification Cosine Accuracy 91.0%
Category Clustering V‑measure (main/sub) 63.7 / 77.2
Information Retrieval nDCG@10 66.3

AccPhysBERT outperforms BERT, SciBERT, and large general-purpose embedding models in all accelerator-specific benchmarks.


Example Usage

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert")
model = AutoModel.from_pretrained("thellert/accphysbert")

text = "We report on beam instabilities observed in the LCLS-II injector."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Use mean pooling (excluding [CLS] and [SEP])
token_embeddings = outputs.last_hidden_state[:, 1:-1, :]
sentence_embedding = token_embeddings.mean(dim=1)

Citation

If you use AccPhysBERT, please cite:

@article{Hellert_2025,
  title     = {Domain-specific text embedding model for accelerator physics},
  author    = {Hellert, Thorsten and Montenegro, João and Venturini, Marco and Pollastro, Andrea},
  journal   = {Physical Review Accelerators and Beams},
  volume    = {28},
  number    = {4},
  pages     = {044601},
  year      = {2025},
  publisher = {American Physical Society},
  doi       = {10.1103/PhysRevAccelBeams.28.044601},
  url       = {https://doi.org/10.1103/PhysRevAccelBeams.28.044601}
}

Contact

Thorsten Hellert
Lawrence Berkeley National Laboratory
📧 [email protected]


Acknowledgments

This model builds on PhysBERT and was trained using NERSC resources. Thanks to Alex Hexemer, Fernando Sannibale, and Antonin Sulc for their support and discussions.

Downloads last month
125
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thellert/accphysbert_cased

Finetuned
(1)
this model

Collection including thellert/accphysbert_cased

Evaluation results

  • Citation Classification on Accelerator Physics Publications
    self-reported
    0.910
  • Category Clustering (main) on Accelerator Physics Publications
    self-reported
    0.637
  • Information Retrieval on Accelerator Physics Publications
    self-reported
    0.663