Update README.md
Browse files
README.md
CHANGED
@@ -8,11 +8,16 @@ tags:
|
|
8 |
- pytorch
|
9 |
library_name: transformers
|
10 |
---
|
|
|
|
|
|
|
|
|
|
|
11 |
# Fanar-1-9B
|
12 |
|
13 |
-
**Fanar-1-9B** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/), a member of Qatar Foundation for Education, Science, and Community Development. We continually pretrain the `google/gemma-2-9b` model on 1T Arabic and English tokens. We pay particular attention to the richness of the Arabic language by supporting Modern Standard Arabic (MSA) and a diverse set of Arabic dialects including Levantine and Egyptian. Fanar
|
14 |
|
15 |
-
The [instruction-tuned version](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) of **Fanar-1-9B** is a core component
|
16 |
|
17 |
We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) with all the details regarding our Fanar GenAI platform. We also provide an API to our models and the GenAI platform (request access [here](https://api.fanar.qa/request/en)).
|
18 |
|
@@ -23,7 +28,7 @@ We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) wit
|
|
23 |
| Attribute | Value |
|
24 |
|---------------------------|------------------------------------|
|
25 |
| Developed by | [QCRI](https://www.hbku.edu.qa/en/qcri) at [HBKU](https://www.hbku.edu.qa/) |
|
26 |
-
| Sponsored by | [
|
27 |
| Model Type | Autoregressive Transformer |
|
28 |
| Parameter Count | 8.7 Billion |
|
29 |
| Context Length | 4096 Tokens |
|
@@ -41,7 +46,7 @@ We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) wit
|
|
41 |
## Model Training
|
42 |
|
43 |
#### Pretraining
|
44 |
-
Fanar-1-9B was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: ~515B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 410B Arabic tokens that we collected, parsed, and
|
45 |
|
46 |
## Getting Started
|
47 |
|
@@ -69,21 +74,22 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
|
69 |
|
70 |
Fanar-1-9B is a base model and can be finetuned for a varierty of usecases such as:
|
71 |
|
72 |
-
- Research on Arabic natural language generation and understanding
|
73 |
- Conversational agents (Arabic only or bilingual)
|
74 |
- Cultural and dialectal question answering in Arabic
|
75 |
-
- Educational, governmental, and civic NLP applications
|
|
|
76 |
|
77 |
-
A finetuned version of Fanar-1-9B can be deployed as part of a broader AI system. Developers are encouraged to implement proper safeguards to ensure culturally respectful, accurate, and safe deployment. It should not be used to generate or spread **harmful, illegal, or misleading content
|
78 |
|
79 |
|
80 |
---
|
81 |
|
82 |
## Ethical Considerations & Limitations
|
83 |
|
84 |
-
Fanar-1-9B is capable of generating fluent and contextually appropriate responses,
|
85 |
|
86 |
The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT or any other organization or individual.
|
|
|
87 |
|
88 |
---
|
89 |
|
@@ -95,13 +101,13 @@ Evaluation was conducted using a modified version of the LM Evaluation Harness a
|
|
95 |
|
96 |
| Model | MMLU (5-shot) | MMMLU (Arabic) (0-shot) | ArabicMMLU (3-shot) | HellaSwag (0-shot) | PIQA (0-shot) | ARC Challenge (0-shot) | Belebele (Arabic) (3-shot) | ACVA (5-shot) | GSM8k | OALL (0-shot) | OALL v2 (0-shot) | Almieyar Arabic (3-shot) | Arab Cultural MCQ (3-shot) | AraDiCE PIQA (MSA) (0-shot) | AraDiCE PIQA(Egy) (0-shot) | AraDiCE PIQA(Lev) (0-shot) | AraDiCE ArabicMMLU(Egy) (0-shot) | AraDiCE ArabicMMLU(Lev) (0-shot) |
|
97 |
|-------|----------------|--------------------------|----------------------|--------------------|---------------|-------------------------|------------------------------|---------------|--------|----------------|------------------|---------------------------|-----------------------------|-------------------------------|------------------------------|------------------------------|-----------------------------------|-----------------------------------|
|
98 |
-
| Fanar-1-9B | 71.33% | 57.38
|
99 |
| AceGPT-v2-8B | 63.55% | 41.71% | 58.55% | 76.97% | 80.03% | 49.40% | 60.61% | 78.36% | 10.92% | 43.58% | - | 66.83% | 67.50% | 63.17% | 61.48% | 56.75% | 43.40% | 40.96% |
|
100 |
-
| gemma-2-9b | 70.60% | 54.04% | 64.32% | 79.82% | 82.97
|
101 |
-
| jais-adapted-13b | 50.42% | 34.01% | 51.96% | 78.02% | 78.94% | 48.55% | 43.02% | 73.52% | 5.76% | 40.79% | 40.06% | 62.34% | 60.90% | 65.02% | 62.19
|
102 |
| jais-family-6p7b | 32.50% | 25.34% | 34.81% | 69.28% | 75.95% | 40.27% | 34.54% | 60.13% | 3.87% | 37.55% | 33.59% | 32.17% | 34.00% | 65.18% | 60.23% | 58.38% | 28.50% | 29.46% |
|
103 |
| Llama-3.1-8B | 65.10% | 43.21% | 55.73% | 78.95% | 81.01% | 53.41% | 61.59% | 77.72% | 26.00% | 43.01% | 52.29% | 63.84% | 60.00% | 57.51% | 55.28% | 53.81% | 41.44% | 38.39% |
|
104 |
-
| Qwen2.5-7B | 74.18
|
105 |
|
106 |
</div>
|
107 |
|
@@ -126,7 +132,7 @@ If you use Fanar-1-9B or [Fanar-1-9B-Instruct](https://huggingface.co/QCRI/Fanar
|
|
126 |
## Acknowledgements
|
127 |
|
128 |
This project is from [Qatar Computing Research Institute (QCRI)](https://qcri.org) at [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.
|
129 |
-
Special thanks to the [
|
130 |
|
131 |
|
132 |
---
|
|
|
8 |
- pytorch
|
9 |
library_name: transformers
|
10 |
---
|
11 |
+
|
12 |
+
<p align="center">
|
13 |
+
<img src="./fanar_logo.jpg" width="200"/>
|
14 |
+
</p>
|
15 |
+
|
16 |
# Fanar-1-9B
|
17 |
|
18 |
+
**Fanar-1-9B** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/), a member of Qatar Foundation for Education, Science, and Community Development. We continually pretrain the `google/gemma-2-9b` model on 1T Arabic and English tokens. We pay particular attention to the richness of the Arabic language by supporting Modern Standard Arabic (MSA) and a diverse set of Arabic dialects, including Gulf, Levantine, and Egyptian. Fanar models, through meticulous curation of the pretraining and instruction-tuning data, are aligned with Islamic values and Arab cultures.
|
19 |
|
20 |
+
The [instruction-tuned version](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) of **Fanar-1-9B** is a core component of the [Fanar GenAI platform](https://fanar.qa/) that offers a suite of capabilities including image generation, video and image understanding, deep thinking, advanced text-to-speech (TTS) and automatic-speech-recognition (ASR), attribution and fact-checking, Islamic RAG, among several other features.
|
21 |
|
22 |
We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) with all the details regarding our Fanar GenAI platform. We also provide an API to our models and the GenAI platform (request access [here](https://api.fanar.qa/request/en)).
|
23 |
|
|
|
28 |
| Attribute | Value |
|
29 |
|---------------------------|------------------------------------|
|
30 |
| Developed by | [QCRI](https://www.hbku.edu.qa/en/qcri) at [HBKU](https://www.hbku.edu.qa/) |
|
31 |
+
| Sponsored by | [Ministry of Communications and Information Technology, State of Qatar](https://www.mcit.gov.qa/en/)
|
32 |
| Model Type | Autoregressive Transformer |
|
33 |
| Parameter Count | 8.7 Billion |
|
34 |
| Context Length | 4096 Tokens |
|
|
|
46 |
## Model Training
|
47 |
|
48 |
#### Pretraining
|
49 |
+
Fanar-1-9B was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: ~515B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 410B Arabic tokens that we collected, parsed, and filtered from a variety of sources, and 102B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.
|
50 |
|
51 |
## Getting Started
|
52 |
|
|
|
74 |
|
75 |
Fanar-1-9B is a base model and can be finetuned for a varierty of usecases such as:
|
76 |
|
|
|
77 |
- Conversational agents (Arabic only or bilingual)
|
78 |
- Cultural and dialectal question answering in Arabic
|
79 |
+
- Educational, governmental, and civic NLP applications focused on the Arab world or Arabic-speaking audiences
|
80 |
+
- Research on Arabic natural language generation and understanding
|
81 |
|
82 |
+
A finetuned version of Fanar-1-9B can be deployed as part of a broader AI system. Developers are encouraged to implement proper safeguards to ensure culturally respectful, accurate, and safe deployment. It should not be used to generate or spread **harmful, illegal, or misleading content.**
|
83 |
|
84 |
|
85 |
---
|
86 |
|
87 |
## Ethical Considerations & Limitations
|
88 |
|
89 |
+
Fanar-1-9B- is capable of generating fluent and contextually appropriate responses. However, as with any generative model there are uncertainities. The model may produce **biased, offensive, or incorrect outputs**. The model is **not suitable for high-stakes decision-making** (e.g., legal, medical, or financial advice). Though we have extensively tested Fanar-1-9B and attempted to mitigate these issues, we cannot redress every possible scenario. Thus, we advise developers to implement safety checks and perform domain-specific fine-tuning for sensitive use cases. Kindly refer to our [Terms of Service]( https://chat.fanar.qa/terms-of-service) and [Privacy Policy](https://chat.fanar.qa/privacy-policy).
|
90 |
|
91 |
The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT or any other organization or individual.
|
92 |
+
|
93 |
|
94 |
---
|
95 |
|
|
|
101 |
|
102 |
| Model | MMLU (5-shot) | MMMLU (Arabic) (0-shot) | ArabicMMLU (3-shot) | HellaSwag (0-shot) | PIQA (0-shot) | ARC Challenge (0-shot) | Belebele (Arabic) (3-shot) | ACVA (5-shot) | GSM8k | OALL (0-shot) | OALL v2 (0-shot) | Almieyar Arabic (3-shot) | Arab Cultural MCQ (3-shot) | AraDiCE PIQA (MSA) (0-shot) | AraDiCE PIQA(Egy) (0-shot) | AraDiCE PIQA(Lev) (0-shot) | AraDiCE ArabicMMLU(Egy) (0-shot) | AraDiCE ArabicMMLU(Lev) (0-shot) |
|
103 |
|-------|----------------|--------------------------|----------------------|--------------------|---------------|-------------------------|------------------------------|---------------|--------|----------------|------------------|---------------------------|-----------------------------|-------------------------------|------------------------------|------------------------------|-----------------------------------|-----------------------------------|
|
104 |
+
| Fanar-1-9B | 71.33% | **57.38%** | **67.42%** | **80.76%** | 81.66% | 59.73% | **79.31%** | **81.31%** | **45.79%** | **54.94%** | **63.20%** | **77.18%** | **72.30%** | **66.00%** | **62.19%** | 57.67% | **55.79%** | **55.63%** |
|
105 |
| AceGPT-v2-8B | 63.55% | 41.71% | 58.55% | 76.97% | 80.03% | 49.40% | 60.61% | 78.36% | 10.92% | 43.58% | - | 66.83% | 67.50% | 63.17% | 61.48% | 56.75% | 43.40% | 40.96% |
|
106 |
+
| gemma-2-9b | 70.60% | 54.04% | 64.32% | 79.82% | **82.97%** | **65.53%** | 75.31% | 79.66% | 21.61% | 50.24% | 57.23% | 73.82% | 68.60% | 63.98% | 60.17% | 58.05% | 49.61% | 47.15% |
|
107 |
+
| jais-adapted-13b | 50.42% | 34.01% | 51.96% | 78.02% | 78.94% | 48.55% | 43.02% | 73.52% | 5.76% | 40.79% | 40.06% | 62.34% | 60.90% | 65.02% | **62.19%** | **59.25%** | 38.24% | 37.93% |
|
108 |
| jais-family-6p7b | 32.50% | 25.34% | 34.81% | 69.28% | 75.95% | 40.27% | 34.54% | 60.13% | 3.87% | 37.55% | 33.59% | 32.17% | 34.00% | 65.18% | 60.23% | 58.38% | 28.50% | 29.46% |
|
109 |
| Llama-3.1-8B | 65.10% | 43.21% | 55.73% | 78.95% | 81.01% | 53.41% | 61.59% | 77.72% | 26.00% | 43.01% | 52.29% | 63.84% | 60.00% | 57.51% | 55.28% | 53.81% | 41.44% | 38.39% |
|
110 |
+
| Qwen2.5-7B | **74.18%** | 51.77% | 65.08% | 78.95% | 79.71% | 51.37% | 71.72% | 80.37% | 9.40% | 48.66% | 59.40% | 76.81% | 65.70% | 59.68% | 57.51% | 55.44% | 47.33% | 49.26% |
|
111 |
|
112 |
</div>
|
113 |
|
|
|
132 |
## Acknowledgements
|
133 |
|
134 |
This project is from [Qatar Computing Research Institute (QCRI)](https://qcri.org) at [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.
|
135 |
+
Special thanks to the [Ministry of Communications and Information Technology, State of Qatar](https://www.mcit.gov.qa/en/) for their continued support by providing the compute infrastructure through the Google Cloud Platform.
|
136 |
|
137 |
|
138 |
---
|