🛠️ alvperez/skill-sim-model

skill-sim-model is a fine-tuned Sentence-Transformers checkpoint that maps short skill phrases (e.g. Python, Forklift operation, Electrical wiring) into a 768‑D vector space where semantically related skills cluster together.
Training pairs come from the public ESCO taxonomy plus curated hard negatives for job‑matching research.

Use‑case	How to leverage the embeddings
Candidate ↔ vacancy matching	`score = cosine(skill_vec, job_vec)`
Deduplicating skill taxonomies	cluster the vectors
Recruiter query‑expansion	nearest‑neighbour search
Exploratory dashboards	feed to t‑SNE / PCA

🚀 Quick start

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("alvperez/skill-sim-model")

skills = ["Electrical wiring",
          "Circuit troubleshooting",
          "Machine learning"]

emb = model.encode(skills, convert_to_tensor=True)
print(util.pytorch_cos_sim(emb[0], emb))   # similarity matrix

from transformers import pipeline
similarity = pipeline("sentence-similarity",
                      model="alvperez/skill-sim-model")
similarity("forklift operation",
           ["pallet jack", "python"])

📊 Benchmark

Metric	Value
Spearman correlation	0.845
ROC AUC	0.988
MAP@all (cold‑start)	0.232

cold‑start = the system sees only skill strings, no historical interactions.

⚙️ Training recipe (brief)

Base: sentence-transformers/all-mpnet-base-v2
Loss: CosineSimilarityLoss
Epochs × batch: 5 × 32
LR / warm‑up: 2 e‑5 / 100 steps
Negatives: random + “hard” pairs from ESCO siblings
Hardware: 1 × A100 40 GB (≈ 45 min)

Full code in /training_scripts.

🏹 Intended use

Employment tech – rank CVs vs. vacancies
EdTech / reskilling – detect skill gaps, suggest learning paths
HR analytics – normalise noisy skill fields at scale

✋ Limitations & bias

Vocabulary dominated by ESCO (English); niche jargon may project poorly.
No explicit fairness constraints; downstream systems should audit (e.g. Disparate Impact).
In our tests, a threshold of 0.65 marks a “definitely related” cut‑off; tune for your own precision‑recall needs.

🔍 Citation

@misc{alvperez2025skillsim,
  title  = {Skill-Sim: a Sentence-Transformers model for skill similarity and job matching},
  author = {Pérez Amado, Álvaro},
  howpublished = {\url{https://huggingface.co/alvperez/skill-sim-model}},
  year   = {2025}
}

Acknowledgements

Built on top of Sentence-Transformers and the public ESCO dataset.
Feedback & PRs welcome!