🛠️ alvperez/skill-sim-model
skill-sim-model is a fine-tuned Sentence-Transformers checkpoint that maps short skill phrases (e.g. Python
, Forklift operation
, Electrical wiring
) into a 768‑D vector space where semantically related skills cluster together.
Training pairs come from the public ESCO taxonomy plus curated hard negatives for job‑matching research.
Use‑case | How to leverage the embeddings |
---|---|
Candidate ↔ vacancy matching | score = cosine(skill_vec, job_vec) |
Deduplicating skill taxonomies | cluster the vectors |
Recruiter query‑expansion | nearest‑neighbour search |
Exploratory dashboards | feed to t‑SNE / PCA |
🚀 Quick start
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("alvperez/skill-sim-model")
skills = ["Electrical wiring",
"Circuit troubleshooting",
"Machine learning"]
emb = model.encode(skills, convert_to_tensor=True)
print(util.pytorch_cos_sim(emb[0], emb)) # similarity matrix
from transformers import pipeline
similarity = pipeline("sentence-similarity",
model="alvperez/skill-sim-model")
similarity("forklift operation",
["pallet jack", "python"])
📊 Benchmark
Metric | Value |
---|---|
Spearman correlation | 0.845 |
ROC AUC | 0.988 |
MAP@all (cold‑start) | 0.232 |
cold‑start = the system sees only skill strings, no historical interactions.
⚙️ Training recipe (brief)
- Base:
sentence-transformers/all-mpnet-base-v2
- Loss:
CosineSimilarityLoss
- Epochs × batch:
5 × 32
- LR / warm‑up:
2 e‑5
/100
steps - Negatives: random + “hard” pairs from ESCO siblings
- Hardware: 1 × A100 40 GB (≈ 45 min)
Full code in /training_scripts
.
🏹 Intended use
- Employment tech – rank CVs vs. vacancies
- EdTech / reskilling – detect skill gaps, suggest learning paths
- HR analytics – normalise noisy skill fields at scale
✋ Limitations & bias
- Vocabulary dominated by ESCO (English); niche jargon may project poorly.
- No explicit fairness constraints; downstream systems should audit (e.g. Disparate Impact).
- In our tests, a threshold of 0.65 marks a “definitely related” cut‑off; tune for your own precision‑recall needs.
🔍 Citation
@misc{alvperez2025skillsim,
title = {Skill-Sim: a Sentence-Transformers model for skill similarity and job matching},
author = {Pérez Amado, Álvaro},
howpublished = {\url{https://huggingface.co/alvperez/skill-sim-model}},
year = {2025}
}
Acknowledgements
Built on top of Sentence-Transformers and the public ESCO dataset.
Feedback & PRs welcome!
- Downloads last month
- 81
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support