GitHub Issues MPNet Sentence Transformer (10 Epochs)

This is a sentence-transformers model, specific for GitHub Issue data.

Dataset

For training, we used the NLBSE22 dataset, after removing issues with empty body and duplicates. Similarity between title and body was used to train the sentence embedding model.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('Collab-uniba/github-issues-mpnet-st-e10')
embeddings = model.encode(sentences)
print(embeddings)

Training

The model was trained for ten epochs, using Multiple Negative Ranking Loss. We assumed that title and body of the same issue have to be similar. We used the following parameters:

DataLoader:

torch.utils.data.dataloader.DataLoader of length 39221 with parameters:

{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}

Parameters of the fit()-Method:

{
    "epochs": 10,
    "evaluation_steps": 0,
    "evaluator": "NoneType",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 39221,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Citing & Authors

@article{Colavito_2025_Benchmarking,
    title        = {Benchmarking large language models for automated labeling: The case of issue report classification},
    author       = {Giuseppe Colavito and Filippo Lanubile and Nicole Novielli},
    year         = 2025,
    journal      = {Information and Software Technology},
    volume       = 184,
    pages        = 107758,
    doi          = {https://doi.org/10.1016/j.infsof.2025.107758},
    issn         = {0950-5849},
    url          = {https://www.sciencedirect.com/science/article/pii/S0950584925000977},
    keywords     = {Issue labeling, Generative AI, Software maintenance and evolution}
}
Downloads last month
27
Safetensors
Model size
109M params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support