Spaces:
Running
Running
Update src/about.py
Browse files- src/about.py +9 -6
src/about.py
CHANGED
@@ -41,26 +41,29 @@ Addressing the gaps in existing LLM evaluation frameworks, this benchmark is spe
|
|
41 |
2. Synthetically generated data (newly created for Persian LLMs)
|
42 |
3. Naturally collected data (reflecting indigenous cultural nuances)
|
43 |
|
44 |
-
|
45 |
> The benchmark integrates the following datasets to ensure a robust evaluation of Persian LLMs:
|
|
|
46 |
> **Translated Datasets**
|
47 |
> • Anthropic-fa
|
48 |
> • AdvBench-fa
|
49 |
-
>
|
50 |
> • DecodingTrust-fa
|
|
|
51 |
> **Newly Developed Persian Datasets**
|
52 |
> • ProhibiBench-fa: Evaluates harmful and prohibited content in Persian culture.
|
53 |
> • SafeBench-fa: Assesses safety in generated outputs.
|
54 |
> • FairBench-fa: Measures bias mitigation in Persian LLMs.
|
55 |
> • SocialBench-fa: Evaluates adherence to culturally accepted behaviors.
|
|
|
56 |
> **Naturally Collected Persian Dataset**
|
57 |
> • GuardBench-fa: A large-scale dataset designed to align Persian LLMs with local cultural norms.
|
58 |
|
59 |
### A Unified Framework for Persian LLM Evaluation
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
|
65 |
|
66 |
This benchmark not only fills a critical gap in Persian LLM evaluation but also provides a standardized leaderboard to track progress in developing aligned, ethical, and culturally aware Persian language models.
|
|
|
41 |
2. Synthetically generated data (newly created for Persian LLMs)
|
42 |
3. Naturally collected data (reflecting indigenous cultural nuances)
|
43 |
|
44 |
+
## Key Datasets in the Benchmark
|
45 |
> The benchmark integrates the following datasets to ensure a robust evaluation of Persian LLMs:
|
46 |
+
>
|
47 |
> **Translated Datasets**
|
48 |
> • Anthropic-fa
|
49 |
> • AdvBench-fa
|
50 |
+
> • HarmBench-fa
|
51 |
> • DecodingTrust-fa
|
52 |
+
>
|
53 |
> **Newly Developed Persian Datasets**
|
54 |
> • ProhibiBench-fa: Evaluates harmful and prohibited content in Persian culture.
|
55 |
> • SafeBench-fa: Assesses safety in generated outputs.
|
56 |
> • FairBench-fa: Measures bias mitigation in Persian LLMs.
|
57 |
> • SocialBench-fa: Evaluates adherence to culturally accepted behaviors.
|
58 |
+
>
|
59 |
> **Naturally Collected Persian Dataset**
|
60 |
> • GuardBench-fa: A large-scale dataset designed to align Persian LLMs with local cultural norms.
|
61 |
|
62 |
### A Unified Framework for Persian LLM Evaluation
|
63 |
+
By combining these datasets, our work establishes a culturally grounded alignment evaluation framework, enabling systematic assessment across three key aspects:
|
64 |
+
• **Safety**: Avoiding harmful or toxic content.
|
65 |
+
• **Fairness**: Mitigating biases in model outputs.
|
66 |
+
• **Social Norms**: Ensuring culturally appropriate behavior.
|
67 |
|
68 |
|
69 |
This benchmark not only fills a critical gap in Persian LLM evaluation but also provides a standardized leaderboard to track progress in developing aligned, ethical, and culturally aware Persian language models.
|