LLM_Alignment_Evaluation

Running

App Files Files Community

LLM_Alignment_Evaluation / src /about.py

MCILAB

Update src/about.py

c7418c1 verified 6 months ago

raw

history blame

5.71 kB

	from dataclasses import dataclass
	from enum import Enum

	@dataclass
	class Task:
	benchmark: str
	metric: str
	col_name: str


	# Select your tasks here
	# ---------------------------------------------------
	class Tasks(Enum):
	# task_key in the json file, metric_key in the json file, name to display in the leaderboard
	task0 = Task("task_name", "safty", "safty")
	task1 = Task("task_name2", "fairness", "fairness")
	task2 = Task("task_name3", "socail-norm", "socail-norm")


	NUM_FEWSHOT = 0 # Change with your few shot
	# ---------------------------------------------------



	# Your leaderboard name
	TITLE = """<h1 align="center" id="space-title">Open Persian LLM Alignment Leaderboard</h1>"""

	# What does your leaderboard evaluate?
	INTRODUCTION_TEXT = """
	Open Persian LLM Alignment Leaderboard
	"""

	# Which evaluations are you running? how can people reproduce what you have?
	LLM_BENCHMARKS_TEXT = f"""
	## Open Persian LLM Alignment Leaderboard

	Developed by MCILAB in collaboration with the Machine Learning Laboratory at Sharif University of Technology , this benchmark is based on the open-source [ELAB](https://arxiv.org/pdf/2504.12553) where presents a comprehensive evaluation framework for assessing the alignment of Persian Large Language Models (LLMs) with critical ethical dimensions, including safety, fairness, and social norms.
	Addressing the gaps in existing LLM evaluation frameworks, this benchmark is specifically tailored to Persian linguistic and cultural contexts.
	### It combines three types of Persian-language benchmarks:
	1. Translated datasets (adapted from established English benchmarks)
	2. Synthetically generated data (newly created for Persian LLMs)
	3. Naturally collected data (reflecting indigenous cultural nuances)

	## Key Datasets in the Benchmark
	> The benchmark integrates the following datasets to ensure a robust evaluation of Persian LLMs:
	>
	> Translated Datasets
	> - Anthropic-fa
	> - AdvBench-fa
	> - HarmBench-fa
	> - DecodingTrust-fa
	>
	> Newly Developed Persian Datasets
	> - ProhibiBench-fa: Evaluates harmful and prohibited content in Persian culture.
	> - SafeBench-fa: Assesses safety in generated outputs.
	> - FairBench-fa: Measures bias mitigation in Persian LLMs.
	> - SocialBench-fa: Evaluates adherence to culturally accepted behaviors.
	>
	> Naturally Collected Persian Dataset
	> - GuardBench-fa: A large-scale dataset designed to align Persian LLMs with local cultural norms.

	### A Unified Framework for Persian LLM Evaluation
	By combining these datasets, our work establishes a culturally grounded alignment evaluation framework, enabling systematic assessment across three key aspects:

	- Safety: Avoiding harmful or toxic content.
	- Fairness: Mitigating biases in model outputs.
	- Social Norms: Ensuring culturally appropriate behavior.


	This benchmark not only fills a critical gap in Persian LLM evaluation but also provides a standardized leaderboard to track progress in developing aligned, ethical, and culturally aware Persian language models.

	### Download Dataset
	The full dataset is not publicly accessible; however, you can download a sample of 1,500 entries [here](https://huggingface.co/datasets/MCILAB/1500_sampel/tree/main). The distribution of this sample is as follows:
	![Demo](https://huggingface.co/spaces/MCILAB/LLM_Alignment_Evaluation/resolve/main/chart.png)
	![Test Image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hf-logo.png)
	"""

	EVALUATION_QUEUE_TEXT = """
	## Some good practices before submitting a model

	### 1) Make sure you can load your model and tokenizer using AutoClasses:
	```python
	from transformers import AutoConfig, AutoModel, AutoTokenizer
	config = AutoConfig.from_pretrained("your model name", revision=revision)
	model = AutoModel.from_pretrained("your model name", revision=revision)
	tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
	```
	If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.

	Note: make sure your model is public!
	Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!

	### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
	It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!

	### 3) Make sure your model has an open license!
	This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗

	### 4) Fill up your model card
	When we add extra information about models to the leaderboard, it will be automatically taken from the model card

	## In case of model failure
	If your model is displayed in the `FAILED` category, its execution stopped.
	Make sure you have followed the above steps first.
	If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
	"""

	CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
	CITATION_BUTTON_TEXT = """
	If you use this benchmark in your research, please cite it as follows:

	@article{ELAB,
	title={Extensive LLM Alignment Benchmark in Persian Language},
	author={Zahra Pourbahman, Fatemeh Rajabi, et al},
	year={2025},
	url={https://arxiv.org/pdf/2504.12553}
	}

	Or in plain text:
	Zahra Pourbahman, Fatemeh Rajabi, et al. "ELAB" (2025).
	"""