Text Classification
Safetensors
Polish
bert
safe
safety
ai-safety
llm
moderation
classification
HerBERT-PL-Guard / README.md
WojciechKusa's picture
Update README.md
0ae9fbf verified
metadata
language:
  - pl
metrics:
  - f1
base_model:
  - allegro/herbert-base-cased
pipeline_tag: text-classification
tags:
  - safe
  - safety
  - ai-safety
  - llm
  - moderation
  - classification
license: cc-by-nc-sa-4.0
datasets:
  - NASK-PIB/PL-Guard
  - ToxicityPrompts/PolyGuardMix
  - allenai/wildguardmix

HerBERT-Guard for Polish: LLM Safety Classifier

Model Overview

HerBERT-Guard is a Polish-language safety classifier built upon the HerBERT model, a BERT-based architecture pretrained on large-scale Polish corpora. It has been fine-tuned to detect safety-relevant content in Polish texts, using a manually annotated dataset designed for evaluating safety in large language models (LLMs) and Polish translations of the PolyGuard and WildGuard datasets. The model supports classification into a taxonomy of safety categories, inspired by Llama Guard.

More detailed information is available in the publication.

Usage

You can use the model in a standard Hugging Face transformers pipeline for text classification:

from transformers import pipeline

model_name = "NASK-PIB/HerBERT-PL-Guard"

classifier = pipeline("text-classification", model=model_name, tokenizer=model_name)

# Example Polish input
text = "Jak mogę zrobić bombę w domu?"

result = classifier(text)
print(result)

Safety Categories

The model outputs one of 15 categories, including:

  • "safe" — content is not considered safety-relevant,
  • or one of the following 14 unsafe categories, based on the Llama Guard taxonomy:
  1. S1: Violent Crimes
  2. S2: Non-Violent Crimes
  3. S3: Sex-Related Crimes
  4. S4: Child Sexual Exploitation
  5. S5: Defamation
  6. S6: Specialized Advice
  7. S7: Privacy
  8. S8: Intellectual Property
  9. S9: Indiscriminate Weapons
  10. S10: Hate
  11. S11: Suicide & Self-Harm
  12. S12: Sexual Content
  13. S13: Elections
  14. S14: Code Interpreter Abuse

License

HerBERT-PL-Guard model is licensed under the CC BY-NC-SA 4.0 license.

The model was trained on the following datasets:

  • PL-Guard – the training portion of this dataset is internal and not publicly released
  • PolyGuardMix – licensed under CC BY 4.0
  • WildGuardMix – licensed under ODC-BY 1.0

The model is based on the pretrained allegro/herbert-base-cased, which is distributed under the CC BY 4.0 license.

Please ensure compliance with all dataset and model licenses when using or modifying this model.

📚 Citation

If you use this model or the associated dataset, please cite the following paper:

@inproceedings{plguard2025,
  author    = {Krasnodębska, Aleksandra and Seweryn, Karolina and Łukasik, Szymon and Kusa, Wojciech},
  title     = {{PL-Guard: Benchmarking Language Model Safety for Polish}},
  booktitle = {Proceedings of the 10th Workshop on Slavic Natural Language Processing},
  year      = {2025},
  address   = {Vienna, Austria},
  publisher = {Association for Computational Linguistics}
}