loaiabdalslam commited on
Commit
a2c6ba7
·
verified ·
1 Parent(s): 6697605

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +220 -31
README.md CHANGED
@@ -12,53 +12,242 @@ tags:
12
  - instruction-following
13
  - text-generation
14
  - merged_16bit
15
- - llama-cpp
16
- - gguf-my-repo
17
- base_model: beetlware/Bee1reason-arabic-Qwen-14B
18
  datasets:
19
  - beetlware/arabic-reasoning-dataset-logic
20
  ---
21
 
22
- # loaiabdalslam/Bee1reason-arabic-Qwen-14B-Q4_K_M-GGUF
23
- This model was converted to GGUF format from [`beetlware/Bee1reason-arabic-Qwen-14B`](https://huggingface.co/beetlware/Bee1reason-arabic-Qwen-14B) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
24
- Refer to the [original model card](https://huggingface.co/beetlware/Bee1reason-arabic-Qwen-14B) for more details on the model.
25
 
26
- ## Use with llama.cpp
27
- Install llama.cpp through brew (works on Mac and Linux)
28
 
29
- ```bash
30
- brew install llama.cpp
31
 
32
- ```
33
- Invoke the llama.cpp server or the CLI.
 
 
 
 
34
 
35
- ### CLI:
36
- ```bash
37
- llama-cli --hf-repo loaiabdalslam/Bee1reason-arabic-Qwen-14B-Q4_K_M-GGUF --hf-file bee1reason-arabic-qwen-14b-q4_k_m.gguf -p "The meaning to life and the universe is"
38
- ```
39
 
40
- ### Server:
41
- ```bash
42
- llama-server --hf-repo loaiabdalslam/Bee1reason-arabic-Qwen-14B-Q4_K_M-GGUF --hf-file bee1reason-arabic-qwen-14b-q4_k_m.gguf -c 2048
43
- ```
44
 
45
- Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well.
 
 
46
 
47
- Step 1: Clone llama.cpp from GitHub.
48
- ```
49
- git clone https://github.com/ggerganov/llama.cpp
50
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
- Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).
 
 
 
 
 
 
 
 
53
  ```
54
- cd llama.cpp && LLAMA_CURL=1 make
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  ```
56
 
57
- Step 3: Run inference through the main binary.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ```
59
- ./llama-cli --hf-repo loaiabdalslam/Bee1reason-arabic-Qwen-14B-Q4_K_M-GGUF --hf-file bee1reason-arabic-qwen-14b-q4_k_m.gguf -p "The meaning to life and the universe is"
 
 
 
 
 
 
 
 
 
60
  ```
61
- or
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ```
63
- ./llama-server --hf-repo loaiabdalslam/Bee1reason-arabic-Qwen-14B-Q4_K_M-GGUF --hf-file bee1reason-arabic-qwen-14b-q4_k_m.gguf -c 2048
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  - instruction-following
13
  - text-generation
14
  - merged_16bit
15
+ base_model: unsloth/Qwen3-14B
 
 
16
  datasets:
17
  - beetlware/arabic-reasoning-dataset-logic
18
  ---
19
 
20
+ # Bee1reason-arabic-Qwen-14B: A Qwen3 14B Model Fine-tuned for Arabic Logical Reasoning
 
 
21
 
22
+ ## Model Overview
 
23
 
24
+ **Bee1reason-arabic-Qwen-14B** is a Large Language Model (LLM) fine-tuned from the `unsloth/Qwen3-14B` base model (which itself is based on `Qwen/Qwen2-14B`). This model has been specifically tailored to enhance logical and deductive reasoning capabilities in the Arabic language, while also maintaining its general conversational abilities. The fine-tuning process utilized LoRA (Low-Rank Adaptation) with the [Unsloth](https://github.com/unslothai/unsloth) library for high training efficiency. The LoRA weights were then merged with the base model to produce this standalone 16-bit (float16) precision model.
 
25
 
26
+ **Key Features:**
27
+ * **Built on `unsloth/Qwen3-14B`:** Leverages the power and performance of the Qwen3 14-billion parameter base model.
28
+ * **Fine-tuned for Arabic Logical Reasoning:** Trained on a dataset containing Arabic logical reasoning tasks.
29
+ * **Conversational Format:** The model follows a conversational format, expecting user and assistant roles. It was trained on data that may include "thinking steps" (often within `<think>...</think>` tags) before providing the final answer, which is beneficial for tasks requiring explanation or complex inference.
30
+ * **Unsloth Efficiency:** The Unsloth library was used for the fine-tuning process, enabling faster training and reduced GPU memory consumption.
31
+ * **Merged 16-bit Model:** The final weights are a full float16 precision model, ready for direct use without needing to apply LoRA adapters to a separate base model.
32
 
33
+ ## Training Data
 
 
 
34
 
35
+ The model was primarily fine-tuned on a custom Arabic logical reasoning dataset, `beetlware/arabic-reasoning-dataset-logic`, available on the Hugging Face Hub. This dataset includes tasks variés types of reasoning (deduction, induction, abduction), with each task comprising the question text, a proposed answer, and a detailed solution including thinking steps.
 
 
 
36
 
37
+ This data was converted into a conversational format for training, typically with:
38
+ 1. **User Role:** Containing the problem/question text.
39
+ 2. **Assistant Role:** Containing the detailed solution, including thinking steps (often within `<think>...</think>` tags) followed by the final answer.
40
 
41
+ ## Fine-tuning Details
42
+
43
+ * **Base Model:** `unsloth/Qwen3-14B`
44
+ * **Fine-tuning Technique:** LoRA (Low-Rank Adaptation)
45
+ * `r` (rank): 32
46
+ * `lora_alpha`: 32
47
+ * `target_modules`: `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`
48
+ * `lora_dropout`: 0
49
+ * `bias`: "none"
50
+ * **Libraries Used:** Unsloth (for efficient model loading and PEFT application) and Hugging Face TRL (`SFTTrainer`)
51
+ * **Max Sequence Length (`max_seq_length`):** 2048 tokens
52
+ * **Training Parameters (example from notebook):**
53
+ * `per_device_train_batch_size`: 2
54
+ * `gradient_accumulation_steps`: 4 (simulating a total batch size of 8)
55
+ * `warmup_steps`: 5
56
+ * `max_steps`: 30 (in the notebook, adjustable for a full run)
57
+ * `learning_rate`: 2e-4 (recommended to reduce to 2e-5 for longer training runs)
58
+ * `optim`: "adamw_8bit"
59
+ * **Final Save:** LoRA weights were merged with the base model and saved in `merged_16bit` (float16) precision.
60
+
61
+ ## How to Use (with Transformers)
62
+
63
+ Since this is a merged 16-bit model, you can load and use it directly with the `transformers` library:
64
+
65
+ ```python
66
+ from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
67
+ import torch
68
+
69
+ model_id = "beetlware/Bee1reason-arabic-Qwen-14B"
70
+
71
+ # Load the Tokenizer
72
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
73
 
74
+ # Load the Model
75
+ model = AutoModelForCausalLM.from_pretrained(
76
+ model_id,
77
+ torch_dtype=torch.bfloat16, # or torch.float16 if bfloat16 is not supported
78
+ device_map="auto", # Distributes the model on available devices (GPU/CPU)
79
+ )
80
+
81
+ # Ensure the model is in evaluation mode for inference
82
+ model.eval()
83
  ```
84
+ ### --- Example for Inference with Thinking Steps (if the model was trained to produce them) ---
85
+ ### Qwen3 models expect special tags for thinking <think>...</think>
86
+ ### To enable thinking mode during inference (if supported by the fine-tuned model):
87
+ ### You might need to craft the prompt to ask the model to think.
88
+ ### Unsloth-trained Qwen3 models often respond to enable_thinking in tokenizer.apply_chat_template.
89
+ ### For a merged model, its ability to show <think> depends on the training data.
90
+ ```python
91
+ user_prompt_with_thinking_request = "استخدم التفكير المنطقي خطوة بخطوة: إذا كان لدي 4 تفاحات والشجرة فيها 20 تفاحة، فكم تفاحة لدي إجمالاً؟" # "Use step-by-step logical thinking: If I have 4 apples and the tree has 20 apples, how many apples do I have in total?"
92
+
93
+ messages_with_thinking = [
94
+ {"role": "user", "content": user_prompt_with_thinking_request}
95
+ ]
96
+
97
+ # Apply the chat template
98
+ # Qwen3 uses a specific chat template. tokenizer.apply_chat_template is the correct way to format it.
99
+ chat_prompt_with_thinking = tokenizer.apply_chat_template(
100
+ messages_with_thinking,
101
+ tokenize=False,
102
+ add_generation_prompt=True # Important for adding the assistant's generation prompt
103
+ )
104
+
105
+ inputs_with_thinking = tokenizer(chat_prompt_with_thinking, return_tensors="pt").to(model.device)
106
+
107
+ print("\n--- Inference with Thinking Request (Example) ---")
108
+ streamer_think = TextStreamer(tokenizer, skip_prompt=True)
109
+ with torch.no_grad(): # Important to disable gradients during inference
110
+ outputs_think = model.generate(
111
+ **inputs_with_thinking,
112
+ max_new_tokens=512,
113
+ temperature=0.6, # Recommended settings for reasoning by Qwen team
114
+ top_p=0.95,
115
+ top_k=20,
116
+ pad_token_id=tokenizer.eos_token_id,
117
+ streamer=streamer_think
118
+ )
119
  ```
120
 
121
+ ```python
122
+ # --- Example for Normal Inference (Conversation without explicit thinking request) ---
123
+ user_prompt_normal = "ما هي عاصمة مصر؟" # "What is the capital of Egypt?"
124
+ messages_normal = [
125
+ {"role": "user", "content": user_prompt_normal}
126
+ ]
127
+
128
+ chat_prompt_normal = tokenizer.apply_chat_template(
129
+ messages_normal,
130
+ tokenize=False,
131
+ add_generation_prompt=True
132
+ )
133
+ inputs_normal = tokenizer(chat_prompt_normal, return_tensors="pt").to(model.device)
134
+
135
+ print("\n\n--- Normal Inference (Example) ---")
136
+ streamer_normal = TextStreamer(tokenizer, skip_prompt=True)
137
+ with torch.no_grad():
138
+ outputs_normal = model.generate(
139
+ **inputs_normal,
140
+ max_new_tokens=100,
141
+ temperature=0.7, # Recommended settings for normal chat
142
+ top_p=0.8,
143
+ top_k=20,
144
+ pad_token_id=tokenizer.eos_token_id,
145
+ streamer=streamer_normal
146
+ )
147
  ```
148
+
149
+
150
+ ## Usage with VLLM (for High-Throughput Scaled Inference)
151
+ VLLM is a library for fast LLM inference. Since you saved the model as merged_16bit, it can be used with VLLM.
152
+
153
+ 1. Install VLLM:
154
+
155
+ ```bash
156
+
157
+ pip install vllm
158
  ```
159
+ (VLLM installation might have specific CUDA and PyTorch version requirements. Refer to the VLLM documentation for the latest installation prerequisites.)
160
+
161
+ 2. Run the VLLM OpenAI-Compatible Server:
162
+ You can serve the model using VLLM's OpenAI-compatible API server, making it easy to integrate into existing applications.
163
+
164
+ ```bash
165
+ python -m vllm.entrypoints.openai.api_server \
166
+ --model beetlware/Bee1reason-arabic-Qwen-14B \
167
+ --tokenizer beetlware/Bee1reason-arabic-Qwen-14B \
168
+ --dtype bfloat16 \
169
+ --max-model-len 2048 \
170
+ # --tensor-parallel-size N # If you have multiple GPUs
171
+ # --gpu-memory-utilization 0.9 # To adjust GPU memory usage
172
+
173
  ```
174
+ - Replace --dtype bfloat16 with float16 if needed.
175
+ - max-model-len should match the max_seq_length you used.
176
+
177
+ 3. Send Requests to the VLLM Server:
178
+ Once the server is running (typically on http://localhost:8000), you can send requests using any OpenAI-compatible client, like the openai library:
179
+ ```python
180
+
181
+ import openai
182
+
183
+ client = openai.OpenAI(
184
+ base_url="http://localhost:8000/v1", # VLLM server address
185
+ api_key="dummy_key" # VLLM doesn't require an actual API key by default
186
+ )
187
+
188
+ completion = client.chat.completions.create(
189
+ model="beetlware/Bee1reason-arabic-Qwen-14B", # Model name as specified in VLLM
190
+ messages=[
191
+ {"role": "user", "content": "اشرح نظرية النسبية العامة بكلمات بسيطة."} # "Explain the theory of general relativity in simple terms."
192
+ ],
193
+ max_tokens=256,
194
+ temperature=0.7,
195
+ stream=True # To enable streaming
196
+ )
197
+
198
+ print("Streaming response from VLLM:")
199
+ full_response = ""
200
+ for chunk in completion:
201
+ if chunk.choices[0].delta.content is not None:
202
+ token = chunk.choices[0].delta.content
203
+ print(token, end="", flush=True)
204
+ full_response += token
205
+ print("\n--- End of stream ---")
206
+
207
  ```
208
+
209
+
210
+ # Limitations and Potential Biases
211
+ The model's performance is highly dependent on the quality and diversity of the training data. It may exhibit biases present in the data it was trained on.
212
+ Despite fine-tuning for logical reasoning, the model might still make errors on very complex or unfamiliar reasoning tasks.
213
+ The model may "hallucinate" or produce incorrect information, especially for topics not well-covered in its training data.
214
+ Capabilities in languages other than Arabic (if primarily trained on Arabic) might be limited.
215
+
216
+
217
+ # Additional Information
218
+ Developed by: [loai abdalslam/Organization - beetleware]
219
+ Upload/Release Date: [21-5-2025]
220
+ Contact / Issue Reporting: [[email protected]]
221
+
222
+ # Beetleware :
223
+
224
+
225
+ We are a software house and digital transformation service provider that was founded six years ago and is based in Saudi Arabia.
226
+
227
+ All rights reserved@2025
228
+
229
+ Our Offices
230
+
231
+ KSA Office
232
+ (+966) 54 597 3282
233
234
+
235
+ Egypt Office
236
+ (+2) 010 67 256 306
237
238
+
239
+ Oman Office
240
+ (+968) 9522 8632
241
+
242
+
243
+
244
+
245
+ # Uploaded model
246
+
247
+ - **Developed by:** beetlware AI Team
248
+ - **License:** apache-2.0
249
+ - **Finetuned from model :** unsloth/qwen3-14b-unsloth-bnb-4bit
250
+
251
+ This qwen3 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
252
+
253
+ [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)