QCRI
/

Fanar-1-9B

@@ -8,11 +8,16 @@ tags:
 - pytorch
 library_name: transformers
 ---
 # Fanar-1-9B
-**Fanar-1-9B** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/), a member of Qatar Foundation for Education, Science, and Community Development. We continually pretrain the `google/gemma-2-9b` model on 1T Arabic and English tokens. We pay particular attention to the richness of the Arabic language by supporting Modern Standard Arabic (MSA) and a diverse set of Arabic dialects including Levantine and Egyptian. Fanar-1-9B, through meticulous curation, filtering, and sampling of the pretraining data, is aligned with Islamic values and Arab cultures.
-The [instruction-tuned version](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) of **Fanar-1-9B** is a core component within the [Fanar GenAI platform](https://fanar.qa/) that offers a suite of capabilities including image generation, video and image understanding, deep thinking, advanced text-to-speech (TTS) and automatic-speech-recognition (ASR), attribution and fact-checking, Islamic RAG, among several other features.
 We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) with all the details regarding our Fanar GenAI platform. We also provide an API to our models and the GenAI platform (request access [here](https://api.fanar.qa/request/en)).
@@ -23,7 +28,7 @@ We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) wit
 | Attribute                  | Value                              |
 |---------------------------|------------------------------------|
 | Developed by              | [QCRI](https://www.hbku.edu.qa/en/qcri) at [HBKU](https://www.hbku.edu.qa/)                      |
-| Sponsored by              | [Minisitry of Communications and Technology, State of Qatar](https://www.mcit.gov.qa/en/)
 | Model Type                | Autoregressive Transformer         |
 | Parameter Count           | 8.7 Billion                          |
 | Context Length            | 4096 Tokens                        |
@@ -41,7 +46,7 @@ We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) wit
 ## Model Training
 #### Pretraining
-Fanar-1-9B was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: ~515B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 410B Arabic tokens that we collected, parsed, and flitered from a variety of sources, 102B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.
 ## Getting Started
@@ -69,21 +74,22 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 Fanar-1-9B is a base model and can be finetuned for a varierty of usecases such as:
-- Research on Arabic natural language generation and understanding
 - Conversational agents (Arabic only or bilingual)
 - Cultural and dialectal question answering in Arabic
-- Educational, governmental, and civic NLP applications focussed on the Arab world or Arabic-speaking audiences
-A finetuned version of Fanar-1-9B can be deployed as part of a broader AI system. Developers are encouraged to implement proper safeguards to ensure culturally respectful, accurate, and safe deployment. It should not be used to generate or spread **harmful, illegal, or misleading content**
 ---
 ## Ethical Considerations & Limitations
-Fanar-1-9B is capable of generating fluent and contextually appropriate responses, but as with any generative model there are uncertainities. The model may produce **biased, offensive, or incorrect outputs**. The model is **not suitable for high-stakes decision making** (e.g., legal, medical, or financial advice). Though we have extensively tested Fanar-1-9B and attempted to mitigate these issues, we cannot redress every possible scenario. Thus, we advise developers to implement safety checks and perform domain-specific fine-tuning for sensitive use cases. Kindly refer to our [Terms of Service]( https://chat.fanar.qa/terms-of-service) and [Privacy Policy](https://chat.fanar.qa/privacy-policy).
 The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT or any other organization or individual.
 ---
@@ -95,13 +101,13 @@ Evaluation was conducted using a modified version of the LM Evaluation Harness a
 | Model | MMLU (5-shot) | MMMLU (Arabic) (0-shot) | ArabicMMLU (3-shot) | HellaSwag (0-shot) | PIQA (0-shot) | ARC Challenge (0-shot) | Belebele (Arabic) (3-shot) | ACVA (5-shot) | GSM8k | OALL (0-shot) | OALL v2 (0-shot) | Almieyar Arabic (3-shot) | Arab Cultural MCQ (3-shot) | AraDiCE PIQA (MSA) (0-shot) | AraDiCE PIQA(Egy) (0-shot) | AraDiCE PIQA(Lev) (0-shot) | AraDiCE ArabicMMLU(Egy) (0-shot) | AraDiCE ArabicMMLU(Lev) (0-shot) |
 |-------|----------------|--------------------------|----------------------|--------------------|---------------|-------------------------|------------------------------|---------------|--------|----------------|------------------|---------------------------|-----------------------------|-------------------------------|------------------------------|------------------------------|-----------------------------------|-----------------------------------|
-| Fanar-1-9B | 71.33% | 57.38% | 67.42% | 80.76% | 81.66% | 59.73% | 79.31% | 81.31% | 45.79% | 54.94% | 63.20% | 77.18% | 72.30% | 66.00% | 62.19% | 57.67% | 55.79% | 55.63% |
 | AceGPT-v2-8B | 63.55% | 41.71% | 58.55% | 76.97% | 80.03% | 49.40% | 60.61% | 78.36% | 10.92% | 43.58% | - | 66.83% | 67.50% | 63.17% | 61.48% | 56.75% | 43.40% | 40.96% |
-| gemma-2-9b | 70.60% | 54.04% | 64.32% | 79.82% | 82.97% | 65.53% | 75.31% | 79.66% | 21.61% | 50.24% | 57.23% | 73.82% | 68.60% | 63.98% | 60.17% | 58.05% | 49.61% | 47.15% |
-| jais-adapted-13b | 50.42% | 34.01% | 51.96% | 78.02% | 78.94% | 48.55% | 43.02% | 73.52% | 5.76% | 40.79% | 40.06% | 62.34% | 60.90% | 65.02% | 62.19% | 59.25% | 38.24% | 37.93% |
 | jais-family-6p7b | 32.50% | 25.34% | 34.81% | 69.28% | 75.95% | 40.27% | 34.54% | 60.13% | 3.87% | 37.55% | 33.59% | 32.17% | 34.00% | 65.18% | 60.23% | 58.38% | 28.50% | 29.46% |
 | Llama-3.1-8B | 65.10% | 43.21% | 55.73% | 78.95% | 81.01% | 53.41% | 61.59% | 77.72% | 26.00% | 43.01% | 52.29% | 63.84% | 60.00% | 57.51% | 55.28% | 53.81% | 41.44% | 38.39% |
-| Qwen2.5-7B | 74.18% | 51.77% | 65.08% | 78.95% | 79.71% | 51.37% | 71.72% | 80.37% | 9.40% | 48.66% | 59.40% | 76.81% | 65.70% | 59.68% | 57.51% | 55.44% | 47.33% | 49.26% |
 </div>
@@ -126,7 +132,7 @@ If you use Fanar-1-9B or [Fanar-1-9B-Instruct](https://huggingface.co/QCRI/Fanar
 ## Acknowledgements
 This project is from [Qatar Computing Research Institute (QCRI)](https://qcri.org) at [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.
-Special thanks to the [Minisitry of Communications and Technology, State of Qatar](https://www.mcit.gov.qa/en/) for their continued support by providing the compute infrastructure through the Google Cloud Platform.
 ---

 - pytorch
 library_name: transformers
 ---
+<p align="center">
+  <img src="./fanar_logo.jpg" width="200"/>
+</p>
 # Fanar-1-9B
+**Fanar-1-9B** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/), a member of Qatar Foundation for Education, Science, and Community Development. We continually pretrain the `google/gemma-2-9b` model on 1T Arabic and English tokens. We pay particular attention to the richness of the Arabic language by supporting Modern Standard Arabic (MSA) and a diverse set of Arabic dialects, including Gulf, Levantine, and Egyptian. Fanar models, through meticulous curation of the pretraining and instruction-tuning data, are aligned with Islamic values and Arab cultures.
+The [instruction-tuned version](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) of **Fanar-1-9B** is a core component of the [Fanar GenAI platform](https://fanar.qa/) that offers a suite of capabilities including image generation, video and image understanding, deep thinking, advanced text-to-speech (TTS) and automatic-speech-recognition (ASR), attribution and fact-checking, Islamic RAG, among several other features.
 We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) with all the details regarding our Fanar GenAI platform. We also provide an API to our models and the GenAI platform (request access [here](https://api.fanar.qa/request/en)).
 | Attribute                  | Value                              |
 |---------------------------|------------------------------------|
 | Developed by              | [QCRI](https://www.hbku.edu.qa/en/qcri) at [HBKU](https://www.hbku.edu.qa/)                      |
+| Sponsored by              | [Ministry of Communications and Information Technology, State of Qatar](https://www.mcit.gov.qa/en/)
 | Model Type                | Autoregressive Transformer         |
 | Parameter Count           | 8.7 Billion                          |
 | Context Length            | 4096 Tokens                        |
 ## Model Training
 #### Pretraining
+Fanar-1-9B was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: ~515B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 410B Arabic tokens that we collected, parsed, and filtered from a variety of sources, and 102B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.
 ## Getting Started
 Fanar-1-9B is a base model and can be finetuned for a varierty of usecases such as:
 - Conversational agents (Arabic only or bilingual)
 - Cultural and dialectal question answering in Arabic
+- Educational, governmental, and civic NLP applications focused on the Arab world or Arabic-speaking audiences
+- Research on Arabic natural language generation and understanding
+A finetuned version of Fanar-1-9B can be deployed as part of a broader AI system. Developers are encouraged to implement proper safeguards to ensure culturally respectful, accurate, and safe deployment. It should not be used to generate or spread **harmful, illegal, or misleading content.**
 ---
 ## Ethical Considerations & Limitations
+Fanar-1-9B- is capable of generating fluent and contextually appropriate responses. However, as with any generative model there are uncertainities. The model may produce **biased, offensive, or incorrect outputs**. The model is **not suitable for high-stakes decision-making** (e.g., legal, medical, or financial advice). Though we have extensively tested Fanar-1-9B and attempted to mitigate these issues, we cannot redress every possible scenario. Thus, we advise developers to implement safety checks and perform domain-specific fine-tuning for sensitive use cases. Kindly refer to our [Terms of Service]( https://chat.fanar.qa/terms-of-service) and [Privacy Policy](https://chat.fanar.qa/privacy-policy).
 The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT or any other organization or individual.
 ---
 | Model | MMLU (5-shot) | MMMLU (Arabic) (0-shot) | ArabicMMLU (3-shot) | HellaSwag (0-shot) | PIQA (0-shot) | ARC Challenge (0-shot) | Belebele (Arabic) (3-shot) | ACVA (5-shot) | GSM8k | OALL (0-shot) | OALL v2 (0-shot) | Almieyar Arabic (3-shot) | Arab Cultural MCQ (3-shot) | AraDiCE PIQA (MSA) (0-shot) | AraDiCE PIQA(Egy) (0-shot) | AraDiCE PIQA(Lev) (0-shot) | AraDiCE ArabicMMLU(Egy) (0-shot) | AraDiCE ArabicMMLU(Lev) (0-shot) |
 |-------|----------------|--------------------------|----------------------|--------------------|---------------|-------------------------|------------------------------|---------------|--------|----------------|------------------|---------------------------|-----------------------------|-------------------------------|------------------------------|------------------------------|-----------------------------------|-----------------------------------|
+| Fanar-1-9B | 71.33% | **57.38%** | **67.42%** | **80.76%** | 81.66% | 59.73% | **79.31%** | **81.31%** | **45.79%** | **54.94%** | **63.20%** | **77.18%** | **72.30%** | **66.00%** | **62.19%** | 57.67% | **55.79%** | **55.63%** |
 | AceGPT-v2-8B | 63.55% | 41.71% | 58.55% | 76.97% | 80.03% | 49.40% | 60.61% | 78.36% | 10.92% | 43.58% | - | 66.83% | 67.50% | 63.17% | 61.48% | 56.75% | 43.40% | 40.96% |
+| gemma-2-9b | 70.60% | 54.04% | 64.32% | 79.82% | **82.97%** | **65.53%** | 75.31% | 79.66% | 21.61% | 50.24% | 57.23% | 73.82% | 68.60% | 63.98% | 60.17% | 58.05% | 49.61% | 47.15% |
+| jais-adapted-13b | 50.42% | 34.01% | 51.96% | 78.02% | 78.94% | 48.55% | 43.02% | 73.52% | 5.76% | 40.79% | 40.06% | 62.34% | 60.90% | 65.02% | **62.19%** | **59.25%** | 38.24% | 37.93% |
 | jais-family-6p7b | 32.50% | 25.34% | 34.81% | 69.28% | 75.95% | 40.27% | 34.54% | 60.13% | 3.87% | 37.55% | 33.59% | 32.17% | 34.00% | 65.18% | 60.23% | 58.38% | 28.50% | 29.46% |
 | Llama-3.1-8B | 65.10% | 43.21% | 55.73% | 78.95% | 81.01% | 53.41% | 61.59% | 77.72% | 26.00% | 43.01% | 52.29% | 63.84% | 60.00% | 57.51% | 55.28% | 53.81% | 41.44% | 38.39% |
+| Qwen2.5-7B | **74.18%** | 51.77% | 65.08% | 78.95% | 79.71% | 51.37% | 71.72% | 80.37% | 9.40% | 48.66% | 59.40% | 76.81% | 65.70% | 59.68% | 57.51% | 55.44% | 47.33% | 49.26% |
 </div>
 ## Acknowledgements
 This project is from [Qatar Computing Research Institute (QCRI)](https://qcri.org) at [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.
+Special thanks to the [Ministry of Communications and Information Technology, State of Qatar](https://www.mcit.gov.qa/en/) for their continued support by providing the compute infrastructure through the Google Cloud Platform.
 ---