Menon nb-bert relevance scorer

Binary classifier built on top of NbAiLab/nb-bert-base. Used by Menon Economics to score procurement notices as RELEVANT or NOT_RELEVANT for Menon's lead pipeline.

How it works

The model takes the Norwegian-language project description (kort_beskrivelse) and returns a relevance score in [0, 1]. A tuned threshold (saved in threshold.json) converts the score into a binary label.

The training pipeline used:

description-only input (no tittel, no oppdragsgiver, no portal/country features) to avoid client- or country-identity shortcuts;
per-row weighting that downweights near-duplicate templated negatives and upweights international positives;
a stratified train/validation/test split with a held-out test set the model never saw during threshold tuning.

Empty / placeholder / non-Norwegian inputs are routed to needs_review rather than being scored, so the model only commits to a label on inputs it can reasonably judge.

Held-out test results (n = 1,214)

split	precision	recall	F1
overall	0.76	0.89	0.82
international subset (n=8)	0.86	1.00	0.92

Threshold tuned on validation for recall ≥ 0.90: 0.2594 (saved in threshold.json).

Usage

from score import score_lead

# Norwegian input — gets a real score
score_lead("Anskaffelse av samfunnsøkonomisk analyse for evaluering...")
# → {"label": "RELEVANT", "score": 0.83, "threshold": 0.2594, "reason": "ok"}

# Empty / placeholder / non-Norwegian input — routed to review, not scored
score_lead("")
# → {"label": "needs_review", "score": None, "reason": "empty"}

score_lead("Se konkurransegrunnlag")
# → {"label": "needs_review", "score": None, "reason": "too_short(len=22)"}

score_lead("TRANSQ is a joint qualification system for transport suppliers.")
# → {"label": "needs_review", "score": None, "reason": "non_norwegian(en)"}

Important: input must be in Norwegian

The model assumes incoming descriptions are already in Norwegian Bokmål. The lead-scraper translates non-Norwegian leads upstream, so by the time a lead reaches this model in production it is in Norwegian.

If a description in another language slips through, it is intentionally flagged needs_review so a human can fetch a correct translation rather than the model returning a low-confidence guess. For one-off ad-hoc scoring of raw foreign text, translate it with any tool (DeepL / OpenAI / GPT / Google) before calling score_lead.

Requires:

transformers, torch, langdetect
No API keys needed.

Files in this repo

file	purpose
`model.safetensors`, `config.json`	Model weights + config
`tokenizer.json`, `vocab.txt`, etc.	Tokenizer
`threshold.json`	Tuned decision threshold
`inference_rules.py`	`needs_review()` gate (empty / short / placeholder / non-Norwegian)
`score.py`	End-to-end scoring function (use this)

Training data

Roughly 13,000 labeled procurement leads from doffin / mercell / TED / Nordisk ministerråd / hilma / FHF, with per-row weights encoding class balance, cluster-based deduplication of near-duplicate negatives, and an upweight on international positives. After filtering inputs that the needs_review gate would catch, about 12,100 rows were used for training.

The dataset was split 80 / 10 / 10 (train / validation / test), stratified by (Is_relevant, international) so the rare international examples are represented in every split.

Caveats

The international evaluation subset is small (~8 held-out positives). The 100% recall on that subset is encouraging but high-variance.
The needs_review gate accepts Danish and Swedish leniently — those languages are mutually intelligible with Norwegian Bokmål and the underlying model handles them well, so they pass through.
Production assumption: leads arrive translated. Historically about 5–7 non-Norwegian leads/month slip past the scraper; under this model they are routed to human review.

Downloads last month: 72

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for RozaA/Menon-nb-bert-base-v2

Base model

NbAiLab/nb-bert-base

Finetuned

(26)

this model