shamz15531 commited on
Commit
b8e78b0
·
verified ·
1 Parent(s): 4a82767

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -13
README.md CHANGED
@@ -8,11 +8,16 @@ tags:
8
  - pytorch
9
  library_name: transformers
10
  ---
 
 
 
 
 
11
  # Fanar-1-9B
12
 
13
- **Fanar-1-9B** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/), a member of Qatar Foundation for Education, Science, and Community Development. We continually pretrain the `google/gemma-2-9b` model on 1T Arabic and English tokens. We pay particular attention to the richness of the Arabic language by supporting Modern Standard Arabic (MSA) and a diverse set of Arabic dialects including Levantine and Egyptian. Fanar-1-9B, through meticulous curation, filtering, and sampling of the pretraining data, is aligned with Islamic values and Arab cultures.
14
 
15
- The [instruction-tuned version](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) of **Fanar-1-9B** is a core component within the [Fanar GenAI platform](https://fanar.qa/) that offers a suite of capabilities including image generation, video and image understanding, deep thinking, advanced text-to-speech (TTS) and automatic-speech-recognition (ASR), attribution and fact-checking, Islamic RAG, among several other features.
16
 
17
  We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) with all the details regarding our Fanar GenAI platform. We also provide an API to our models and the GenAI platform (request access [here](https://api.fanar.qa/request/en)).
18
 
@@ -23,7 +28,7 @@ We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) wit
23
  | Attribute | Value |
24
  |---------------------------|------------------------------------|
25
  | Developed by | [QCRI](https://www.hbku.edu.qa/en/qcri) at [HBKU](https://www.hbku.edu.qa/) |
26
- | Sponsored by | [Minisitry of Communications and Technology, State of Qatar](https://www.mcit.gov.qa/en/)
27
  | Model Type | Autoregressive Transformer |
28
  | Parameter Count | 8.7 Billion |
29
  | Context Length | 4096 Tokens |
@@ -41,7 +46,7 @@ We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) wit
41
  ## Model Training
42
 
43
  #### Pretraining
44
- Fanar-1-9B was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: ~515B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 410B Arabic tokens that we collected, parsed, and flitered from a variety of sources, 102B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.
45
 
46
  ## Getting Started
47
 
@@ -69,21 +74,22 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
69
 
70
  Fanar-1-9B is a base model and can be finetuned for a varierty of usecases such as:
71
 
72
- - Research on Arabic natural language generation and understanding
73
  - Conversational agents (Arabic only or bilingual)
74
  - Cultural and dialectal question answering in Arabic
75
- - Educational, governmental, and civic NLP applications focussed on the Arab world or Arabic-speaking audiences
 
76
 
77
- A finetuned version of Fanar-1-9B can be deployed as part of a broader AI system. Developers are encouraged to implement proper safeguards to ensure culturally respectful, accurate, and safe deployment. It should not be used to generate or spread **harmful, illegal, or misleading content**
78
 
79
 
80
  ---
81
 
82
  ## Ethical Considerations & Limitations
83
 
84
- Fanar-1-9B is capable of generating fluent and contextually appropriate responses, but as with any generative model there are uncertainities. The model may produce **biased, offensive, or incorrect outputs**. The model is **not suitable for high-stakes decision making** (e.g., legal, medical, or financial advice). Though we have extensively tested Fanar-1-9B and attempted to mitigate these issues, we cannot redress every possible scenario. Thus, we advise developers to implement safety checks and perform domain-specific fine-tuning for sensitive use cases. Kindly refer to our [Terms of Service]( https://chat.fanar.qa/terms-of-service) and [Privacy Policy](https://chat.fanar.qa/privacy-policy).
85
 
86
  The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT or any other organization or individual.
 
87
 
88
  ---
89
 
@@ -95,13 +101,13 @@ Evaluation was conducted using a modified version of the LM Evaluation Harness a
95
 
96
  | Model | MMLU (5-shot) | MMMLU (Arabic) (0-shot) | ArabicMMLU (3-shot) | HellaSwag (0-shot) | PIQA (0-shot) | ARC Challenge (0-shot) | Belebele (Arabic) (3-shot) | ACVA (5-shot) | GSM8k | OALL (0-shot) | OALL v2 (0-shot) | Almieyar Arabic (3-shot) | Arab Cultural MCQ (3-shot) | AraDiCE PIQA (MSA) (0-shot) | AraDiCE PIQA(Egy) (0-shot) | AraDiCE PIQA(Lev) (0-shot) | AraDiCE ArabicMMLU(Egy) (0-shot) | AraDiCE ArabicMMLU(Lev) (0-shot) |
97
  |-------|----------------|--------------------------|----------------------|--------------------|---------------|-------------------------|------------------------------|---------------|--------|----------------|------------------|---------------------------|-----------------------------|-------------------------------|------------------------------|------------------------------|-----------------------------------|-----------------------------------|
98
- | Fanar-1-9B | 71.33% | 57.38% | 67.42% | 80.76% | 81.66% | 59.73% | 79.31% | 81.31% | 45.79% | 54.94% | 63.20% | 77.18% | 72.30% | 66.00% | 62.19% | 57.67% | 55.79% | 55.63% |
99
  | AceGPT-v2-8B | 63.55% | 41.71% | 58.55% | 76.97% | 80.03% | 49.40% | 60.61% | 78.36% | 10.92% | 43.58% | - | 66.83% | 67.50% | 63.17% | 61.48% | 56.75% | 43.40% | 40.96% |
100
- | gemma-2-9b | 70.60% | 54.04% | 64.32% | 79.82% | 82.97% | 65.53% | 75.31% | 79.66% | 21.61% | 50.24% | 57.23% | 73.82% | 68.60% | 63.98% | 60.17% | 58.05% | 49.61% | 47.15% |
101
- | jais-adapted-13b | 50.42% | 34.01% | 51.96% | 78.02% | 78.94% | 48.55% | 43.02% | 73.52% | 5.76% | 40.79% | 40.06% | 62.34% | 60.90% | 65.02% | 62.19% | 59.25% | 38.24% | 37.93% |
102
  | jais-family-6p7b | 32.50% | 25.34% | 34.81% | 69.28% | 75.95% | 40.27% | 34.54% | 60.13% | 3.87% | 37.55% | 33.59% | 32.17% | 34.00% | 65.18% | 60.23% | 58.38% | 28.50% | 29.46% |
103
  | Llama-3.1-8B | 65.10% | 43.21% | 55.73% | 78.95% | 81.01% | 53.41% | 61.59% | 77.72% | 26.00% | 43.01% | 52.29% | 63.84% | 60.00% | 57.51% | 55.28% | 53.81% | 41.44% | 38.39% |
104
- | Qwen2.5-7B | 74.18% | 51.77% | 65.08% | 78.95% | 79.71% | 51.37% | 71.72% | 80.37% | 9.40% | 48.66% | 59.40% | 76.81% | 65.70% | 59.68% | 57.51% | 55.44% | 47.33% | 49.26% |
105
 
106
  </div>
107
 
@@ -126,7 +132,7 @@ If you use Fanar-1-9B or [Fanar-1-9B-Instruct](https://huggingface.co/QCRI/Fanar
126
  ## Acknowledgements
127
 
128
  This project is from [Qatar Computing Research Institute (QCRI)](https://qcri.org) at [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.
129
- Special thanks to the [Minisitry of Communications and Technology, State of Qatar](https://www.mcit.gov.qa/en/) for their continued support by providing the compute infrastructure through the Google Cloud Platform.
130
 
131
 
132
  ---
 
8
  - pytorch
9
  library_name: transformers
10
  ---
11
+
12
+ <p align="center">
13
+ <img src="./fanar_logo.jpg" width="200"/>
14
+ </p>
15
+
16
  # Fanar-1-9B
17
 
18
+ **Fanar-1-9B** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/), a member of Qatar Foundation for Education, Science, and Community Development. We continually pretrain the `google/gemma-2-9b` model on 1T Arabic and English tokens. We pay particular attention to the richness of the Arabic language by supporting Modern Standard Arabic (MSA) and a diverse set of Arabic dialects, including Gulf, Levantine, and Egyptian. Fanar models, through meticulous curation of the pretraining and instruction-tuning data, are aligned with Islamic values and Arab cultures.
19
 
20
+ The [instruction-tuned version](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) of **Fanar-1-9B** is a core component of the [Fanar GenAI platform](https://fanar.qa/) that offers a suite of capabilities including image generation, video and image understanding, deep thinking, advanced text-to-speech (TTS) and automatic-speech-recognition (ASR), attribution and fact-checking, Islamic RAG, among several other features.
21
 
22
  We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) with all the details regarding our Fanar GenAI platform. We also provide an API to our models and the GenAI platform (request access [here](https://api.fanar.qa/request/en)).
23
 
 
28
  | Attribute | Value |
29
  |---------------------------|------------------------------------|
30
  | Developed by | [QCRI](https://www.hbku.edu.qa/en/qcri) at [HBKU](https://www.hbku.edu.qa/) |
31
+ | Sponsored by | [Ministry of Communications and Information Technology, State of Qatar](https://www.mcit.gov.qa/en/)
32
  | Model Type | Autoregressive Transformer |
33
  | Parameter Count | 8.7 Billion |
34
  | Context Length | 4096 Tokens |
 
46
  ## Model Training
47
 
48
  #### Pretraining
49
+ Fanar-1-9B was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: ~515B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 410B Arabic tokens that we collected, parsed, and filtered from a variety of sources, and 102B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.
50
 
51
  ## Getting Started
52
 
 
74
 
75
  Fanar-1-9B is a base model and can be finetuned for a varierty of usecases such as:
76
 
 
77
  - Conversational agents (Arabic only or bilingual)
78
  - Cultural and dialectal question answering in Arabic
79
+ - Educational, governmental, and civic NLP applications focused on the Arab world or Arabic-speaking audiences
80
+ - Research on Arabic natural language generation and understanding
81
 
82
+ A finetuned version of Fanar-1-9B can be deployed as part of a broader AI system. Developers are encouraged to implement proper safeguards to ensure culturally respectful, accurate, and safe deployment. It should not be used to generate or spread **harmful, illegal, or misleading content.**
83
 
84
 
85
  ---
86
 
87
  ## Ethical Considerations & Limitations
88
 
89
+ Fanar-1-9B- is capable of generating fluent and contextually appropriate responses. However, as with any generative model there are uncertainities. The model may produce **biased, offensive, or incorrect outputs**. The model is **not suitable for high-stakes decision-making** (e.g., legal, medical, or financial advice). Though we have extensively tested Fanar-1-9B and attempted to mitigate these issues, we cannot redress every possible scenario. Thus, we advise developers to implement safety checks and perform domain-specific fine-tuning for sensitive use cases. Kindly refer to our [Terms of Service]( https://chat.fanar.qa/terms-of-service) and [Privacy Policy](https://chat.fanar.qa/privacy-policy).
90
 
91
  The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT or any other organization or individual.
92
+
93
 
94
  ---
95
 
 
101
 
102
  | Model | MMLU (5-shot) | MMMLU (Arabic) (0-shot) | ArabicMMLU (3-shot) | HellaSwag (0-shot) | PIQA (0-shot) | ARC Challenge (0-shot) | Belebele (Arabic) (3-shot) | ACVA (5-shot) | GSM8k | OALL (0-shot) | OALL v2 (0-shot) | Almieyar Arabic (3-shot) | Arab Cultural MCQ (3-shot) | AraDiCE PIQA (MSA) (0-shot) | AraDiCE PIQA(Egy) (0-shot) | AraDiCE PIQA(Lev) (0-shot) | AraDiCE ArabicMMLU(Egy) (0-shot) | AraDiCE ArabicMMLU(Lev) (0-shot) |
103
  |-------|----------------|--------------------------|----------------------|--------------------|---------------|-------------------------|------------------------------|---------------|--------|----------------|------------------|---------------------------|-----------------------------|-------------------------------|------------------------------|------------------------------|-----------------------------------|-----------------------------------|
104
+ | Fanar-1-9B | 71.33% | **57.38%** | **67.42%** | **80.76%** | 81.66% | 59.73% | **79.31%** | **81.31%** | **45.79%** | **54.94%** | **63.20%** | **77.18%** | **72.30%** | **66.00%** | **62.19%** | 57.67% | **55.79%** | **55.63%** |
105
  | AceGPT-v2-8B | 63.55% | 41.71% | 58.55% | 76.97% | 80.03% | 49.40% | 60.61% | 78.36% | 10.92% | 43.58% | - | 66.83% | 67.50% | 63.17% | 61.48% | 56.75% | 43.40% | 40.96% |
106
+ | gemma-2-9b | 70.60% | 54.04% | 64.32% | 79.82% | **82.97%** | **65.53%** | 75.31% | 79.66% | 21.61% | 50.24% | 57.23% | 73.82% | 68.60% | 63.98% | 60.17% | 58.05% | 49.61% | 47.15% |
107
+ | jais-adapted-13b | 50.42% | 34.01% | 51.96% | 78.02% | 78.94% | 48.55% | 43.02% | 73.52% | 5.76% | 40.79% | 40.06% | 62.34% | 60.90% | 65.02% | **62.19%** | **59.25%** | 38.24% | 37.93% |
108
  | jais-family-6p7b | 32.50% | 25.34% | 34.81% | 69.28% | 75.95% | 40.27% | 34.54% | 60.13% | 3.87% | 37.55% | 33.59% | 32.17% | 34.00% | 65.18% | 60.23% | 58.38% | 28.50% | 29.46% |
109
  | Llama-3.1-8B | 65.10% | 43.21% | 55.73% | 78.95% | 81.01% | 53.41% | 61.59% | 77.72% | 26.00% | 43.01% | 52.29% | 63.84% | 60.00% | 57.51% | 55.28% | 53.81% | 41.44% | 38.39% |
110
+ | Qwen2.5-7B | **74.18%** | 51.77% | 65.08% | 78.95% | 79.71% | 51.37% | 71.72% | 80.37% | 9.40% | 48.66% | 59.40% | 76.81% | 65.70% | 59.68% | 57.51% | 55.44% | 47.33% | 49.26% |
111
 
112
  </div>
113
 
 
132
  ## Acknowledgements
133
 
134
  This project is from [Qatar Computing Research Institute (QCRI)](https://qcri.org) at [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.
135
+ Special thanks to the [Ministry of Communications and Information Technology, State of Qatar](https://www.mcit.gov.qa/en/) for their continued support by providing the compute infrastructure through the Google Cloud Platform.
136
 
137
 
138
  ---