metadata

language:
  - pl
metrics:
  - f1
base_model:
  - allegro/herbert-base-cased
pipeline_tag: text-classification
tags:
  - safe
  - safety
  - ai-safety
  - llm
  - moderation
  - classification
license: cc-by-nc-sa-4.0
datasets:
  - NASK-PIB/PL-Guard
  - ToxicityPrompts/PolyGuardMix
  - allenai/wildguardmix

HerBERT-Guard for Polish: LLM Safety Classifier

Model Overview

HerBERT-Guard is a Polish-language safety classifier built upon the HerBERT model, a BERT-based architecture pretrained on large-scale Polish corpora. It has been fine-tuned to detect safety-relevant content in Polish texts, using a manually annotated dataset designed for evaluating safety in large language models (LLMs) and Polish translations of the PolyGuard and WildGuard datasets. The model supports classification into a taxonomy of safety categories, inspired by Llama Guard.

More detailed information is available in the publication.

Usage

You can use the model in a standard Hugging Face transformers pipeline for text classification:

from transformers import pipeline

model_name = "NASK-PIB/HerBERT-PL-Guard"

classifier = pipeline("text-classification", model=model_name, tokenizer=model_name)

# Example Polish input
text = "Jak mogę zrobić bombę w domu?"

result = classifier(text)
print(result)

Safety Categories

The model outputs one of 15 categories, including:

"safe" — content is not considered safety-relevant,
or one of the following 14 unsafe categories, based on the Llama Guard taxonomy:

S1: Violent Crimes
S2: Non-Violent Crimes
S3: Sex-Related Crimes
S4: Child Sexual Exploitation
S5: Defamation
S6: Specialized Advice
S7: Privacy
S8: Intellectual Property
S9: Indiscriminate Weapons
S10: Hate
S11: Suicide & Self-Harm
S12: Sexual Content
S13: Elections
S14: Code Interpreter Abuse

License

HerBERT-PL-Guard model is licensed under the CC BY-NC-SA 4.0 license.

The model was trained on the following datasets:

PL-Guard – the training portion of this dataset is internal and not publicly released
PolyGuardMix – licensed under CC BY 4.0
WildGuardMix – licensed under ODC-BY 1.0

The model is based on the pretrained allegro/herbert-base-cased, which is distributed under the CC BY 4.0 license.

Please ensure compliance with all dataset and model licenses when using or modifying this model.

📚 Citation

If you use this model or the associated dataset, please cite the following paper:

@inproceedings{plguard2025,
  author    = {Krasnodębska, Aleksandra and Seweryn, Karolina and Łukasik, Szymon and Kusa, Wojciech},
  title     = {{PL-Guard: Benchmarking Language Model Safety for Polish}},
  booktitle = {Proceedings of the 10th Workshop on Slavic Natural Language Processing},
  year      = {2025},
  address   = {Vienna, Austria},
  publisher = {Association for Computational Linguistics}
}