language: en
license: cc-by-4.0
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- bert
- accelerator-physics
- physics
- scientific-literature
- embeddings
- domain-specific
library_name: sentence-transformers
pipeline_tag: feature-extraction
base_model: thellert/physbert_cased
model-index:
- name: AccPhysBERT
results:
- task:
type: feature-extraction
name: Feature Extraction
dataset:
name: Accelerator Physics Publications
type: accelerator-physics
metrics:
- type: cosine_accuracy
value: 0.91
name: Citation Classification
- type: v_measure
value: 0.637
name: Category Clustering (main)
- type: ndcg_at_10
value: 0.663
name: Information Retrieval
datasets:
- inspire-hep
AccPhysBERT
AccPhysBERT is a specialized sentence-embedding model fine-tuned for accelerator physics, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature.
Model Description
- Architecture: BERT-based, fine-tuned from PhysBERT (cased) using Supervised Contrastive Learning (SimCSE).
- Optimized For: Titles, abstracts, proposals, and full text from the accelerator-physics community.
- Notable Features:
- Trained on 109 k accelerator-physics publications from INSPIRE HEP
- Leverages 690 k citation pairs and 2 M synthetic query–source pairs
- Trained via SentenceTransformers to produce dense, semantically rich embeddings
Developed by: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro
Funded by: US Department of Energy, Lawrence Berkeley National Laboratory
Model Type: Sentence embedding (BERT-based, SimCSE fine-tuned)
Language: English
License: CC BY 4.0
Paper: Domain-specific text embedding model for accelerator physics, Phys. Rev. Accel. Beams 28, 044601 (2025)
https://doi.org/10.1103/PhysRevAccelBeams.28.044601
Training Data
Core Corpus:
- 109,000 accelerator-physics publications (INSPIRE HEP category: "Accelerators")
- Over 1 GB of full-text markdown-style text (via OCR/Nougat)
Annotation Sources:
- 690,000 citation pairs
- 49 semantic categories labeled via ChatGPT-4o
- 2,000,000 synthetic query–source pairs generated with LLaMA3-70B
Training Procedure
- Fine-tuning Method: SimCSE (contrastive loss)
- Hyperparameters:
- Batch size: 512
- Learning rate: 2e-4
- Temperature: 0.05
- Weight decay: 0.01
- Optimizer: Adam
- Epochs: 2
- Infrastructure: 32 × NVIDIA A100 GPUs @ NERSC
- Framework: SentenceTransformers
Evaluation Results
Task | Metric | Score |
---|---|---|
Citation Classification | Cosine Accuracy | 91.0% |
Category Clustering | V‑measure (main/sub) | 63.7 / 77.2 |
Information Retrieval | nDCG@10 | 66.3 |
AccPhysBERT outperforms BERT, SciBERT, and large general-purpose embedding models in all accelerator-specific benchmarks.
Example Usage
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert")
model = AutoModel.from_pretrained("thellert/accphysbert")
text = "We report on beam instabilities observed in the LCLS-II injector."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Use mean pooling (excluding [CLS] and [SEP])
token_embeddings = outputs.last_hidden_state[:, 1:-1, :]
sentence_embedding = token_embeddings.mean(dim=1)
Citation
If you use AccPhysBERT, please cite:
@article{Hellert_2025,
title = {Domain-specific text embedding model for accelerator physics},
author = {Hellert, Thorsten and Montenegro, João and Venturini, Marco and Pollastro, Andrea},
journal = {Physical Review Accelerators and Beams},
volume = {28},
number = {4},
pages = {044601},
year = {2025},
publisher = {American Physical Society},
doi = {10.1103/PhysRevAccelBeams.28.044601},
url = {https://doi.org/10.1103/PhysRevAccelBeams.28.044601}
}
Contact
Thorsten Hellert
Lawrence Berkeley National Laboratory
📧 [email protected]
Acknowledgments
This model builds on PhysBERT and was trained using NERSC resources. Thanks to Alex Hexemer, Fernando Sannibale, and Antonin Sulc for their support and discussions.