beetlware
/

Bee1reason-arabic-Qwen-14B-Q4_K_M-GGUF

@@ -12,53 +12,242 @@ tags:
 - instruction-following
 - text-generation
 - merged_16bit
-- llama-cpp
-- gguf-my-repo
-base_model: beetlware/Bee1reason-arabic-Qwen-14B
 datasets:
 - beetlware/arabic-reasoning-dataset-logic
 ---
-# loaiabdalslam/Bee1reason-arabic-Qwen-14B-Q4_K_M-GGUF
-This model was converted to GGUF format from [`beetlware/Bee1reason-arabic-Qwen-14B`](https://huggingface.co/beetlware/Bee1reason-arabic-Qwen-14B) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
-Refer to the [original model card](https://huggingface.co/beetlware/Bee1reason-arabic-Qwen-14B) for more details on the model.
-## Use with llama.cpp
-Install llama.cpp through brew (works on Mac and Linux)
-```bash
-brew install llama.cpp
-```
-Invoke the llama.cpp server or the CLI.
-### CLI:
-```bash
-llama-cli --hf-repo loaiabdalslam/Bee1reason-arabic-Qwen-14B-Q4_K_M-GGUF --hf-file bee1reason-arabic-qwen-14b-q4_k_m.gguf -p "The meaning to life and the universe is"
-```
-### Server:
-```bash
-llama-server --hf-repo loaiabdalslam/Bee1reason-arabic-Qwen-14B-Q4_K_M-GGUF --hf-file bee1reason-arabic-qwen-14b-q4_k_m.gguf -c 2048
-```
-Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well.
-Step 1: Clone llama.cpp from GitHub.
-```
-git clone https://github.com/ggerganov/llama.cpp
-```
-Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).
 ```
-cd llama.cpp && LLAMA_CURL=1 make
 ```
-Step 3: Run inference through the main binary.
 ```
-./llama-cli --hf-repo loaiabdalslam/Bee1reason-arabic-Qwen-14B-Q4_K_M-GGUF --hf-file bee1reason-arabic-qwen-14b-q4_k_m.gguf -p "The meaning to life and the universe is"
 ```
-or
 ```
-./llama-server --hf-repo loaiabdalslam/Bee1reason-arabic-Qwen-14B-Q4_K_M-GGUF --hf-file bee1reason-arabic-qwen-14b-q4_k_m.gguf -c 2048
 ```

 - instruction-following
 - text-generation
 - merged_16bit
+base_model: unsloth/Qwen3-14B
 datasets:
 - beetlware/arabic-reasoning-dataset-logic
 ---
+# Bee1reason-arabic-Qwen-14B: A Qwen3 14B Model Fine-tuned for Arabic Logical Reasoning
+## Model Overview
+**Bee1reason-arabic-Qwen-14B** is a Large Language Model (LLM) fine-tuned from the `unsloth/Qwen3-14B` base model (which itself is based on `Qwen/Qwen2-14B`). This model has been specifically tailored to enhance logical and deductive reasoning capabilities in the Arabic language, while also maintaining its general conversational abilities. The fine-tuning process utilized LoRA (Low-Rank Adaptation) with the [Unsloth](https://github.com/unslothai/unsloth) library for high training efficiency. The LoRA weights were then merged with the base model to produce this standalone 16-bit (float16) precision model.
+**Key Features:**
+* **Built on `unsloth/Qwen3-14B`:** Leverages the power and performance of the Qwen3 14-billion parameter base model.
+* **Fine-tuned for Arabic Logical Reasoning:** Trained on a dataset containing Arabic logical reasoning tasks.
+* **Conversational Format:** The model follows a conversational format, expecting user and assistant roles. It was trained on data that may include "thinking steps" (often within `<think>...</think>` tags) before providing the final answer, which is beneficial for tasks requiring explanation or complex inference.
+* **Unsloth Efficiency:** The Unsloth library was used for the fine-tuning process, enabling faster training and reduced GPU memory consumption.
+* **Merged 16-bit Model:** The final weights are a full float16 precision model, ready for direct use without needing to apply LoRA adapters to a separate base model.
+## Training Data
+The model was primarily fine-tuned on a custom Arabic logical reasoning dataset, `beetlware/arabic-reasoning-dataset-logic`, available on the Hugging Face Hub. This dataset includes tasks variés types of reasoning (deduction, induction, abduction), with each task comprising the question text, a proposed answer, and a detailed solution including thinking steps.
+This data was converted into a conversational format for training, typically with:
+1.  **User Role:** Containing the problem/question text.
+2.  **Assistant Role:** Containing the detailed solution, including thinking steps (often within `<think>...</think>` tags) followed by the final answer.
+## Fine-tuning Details
+* **Base Model:** `unsloth/Qwen3-14B`
+* **Fine-tuning Technique:** LoRA (Low-Rank Adaptation)
+    * `r` (rank): 32
+    * `lora_alpha`: 32
+    * `target_modules`: `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`
+    * `lora_dropout`: 0
+    * `bias`: "none"
+* **Libraries Used:** Unsloth (for efficient model loading and PEFT application) and Hugging Face TRL (`SFTTrainer`)
+* **Max Sequence Length (`max_seq_length`):** 2048 tokens
+* **Training Parameters (example from notebook):**
+    * `per_device_train_batch_size`: 2
+    * `gradient_accumulation_steps`: 4 (simulating a total batch size of 8)
+    * `warmup_steps`: 5
+    * `max_steps`: 30 (in the notebook, adjustable for a full run)
+    * `learning_rate`: 2e-4 (recommended to reduce to 2e-5 for longer training runs)
+    * `optim`: "adamw_8bit"
+* **Final Save:** LoRA weights were merged with the base model and saved in `merged_16bit` (float16) precision.
+## How to Use (with Transformers)
+Since this is a merged 16-bit model, you can load and use it directly with the `transformers` library:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
+import torch
+model_id = "beetlware/Bee1reason-arabic-Qwen-14B"
+# Load the Tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+# Load the Model
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16, # or torch.float16 if bfloat16 is not supported
+    device_map="auto", # Distributes the model on available devices (GPU/CPU)
+)
+# Ensure the model is in evaluation mode for inference
+model.eval()
 ```
+### --- Example for Inference with Thinking Steps (if the model was trained to produce them) ---
+### Qwen3 models expect special tags for thinking <think>...</think>
+### To enable thinking mode during inference (if supported by the fine-tuned model):
+### You might need to craft the prompt to ask the model to think.
+### Unsloth-trained Qwen3 models often respond to enable_thinking in tokenizer.apply_chat_template.
+### For a merged model, its ability to show <think> depends on the training data.
+```python
+user_prompt_with_thinking_request = "استخدم التفكير المنطقي خطوة بخطوة: إذا كان لدي 4 تفاحات والشجرة فيها 20 تفاحة، فكم تفاحة لدي إجمالاً؟" # "Use step-by-step logical thinking: If I have 4 apples and the tree has 20 apples, how many apples do I have in total?"
+messages_with_thinking = [
+    {"role": "user", "content": user_prompt_with_thinking_request}
+]
+# Apply the chat template
+# Qwen3 uses a specific chat template. tokenizer.apply_chat_template is the correct way to format it.
+chat_prompt_with_thinking = tokenizer.apply_chat_template(
+    messages_with_thinking,
+    tokenize=False,
+    add_generation_prompt=True # Important for adding the assistant's generation prompt
+)
+inputs_with_thinking = tokenizer(chat_prompt_with_thinking, return_tensors="pt").to(model.device)
+print("\n--- Inference with Thinking Request (Example) ---")
+streamer_think = TextStreamer(tokenizer, skip_prompt=True)
+with torch.no_grad(): # Important to disable gradients during inference
+    outputs_think = model.generate(
+        **inputs_with_thinking,
+        max_new_tokens=512,
+        temperature=0.6, # Recommended settings for reasoning by Qwen team
+        top_p=0.95,
+        top_k=20,
+        pad_token_id=tokenizer.eos_token_id,
+        streamer=streamer_think
+    )
 ```
+```python
+# --- Example for Normal Inference (Conversation without explicit thinking request) ---
+user_prompt_normal = "ما هي عاصمة مصر؟" # "What is the capital of Egypt?"
+messages_normal = [
+    {"role": "user", "content": user_prompt_normal}
+]
+chat_prompt_normal = tokenizer.apply_chat_template(
+    messages_normal,
+    tokenize=False,
+    add_generation_prompt=True
+)
+inputs_normal = tokenizer(chat_prompt_normal, return_tensors="pt").to(model.device)
+print("\n\n--- Normal Inference (Example) ---")
+streamer_normal = TextStreamer(tokenizer, skip_prompt=True)
+with torch.no_grad():
+    outputs_normal = model.generate(
+        **inputs_normal,
+        max_new_tokens=100,
+        temperature=0.7, # Recommended settings for normal chat
+        top_p=0.8,
+        top_k=20,
+        pad_token_id=tokenizer.eos_token_id,
+        streamer=streamer_normal
+    )
 ```
+## Usage with VLLM (for High-Throughput Scaled Inference)
+VLLM is a library for fast LLM inference. Since you saved the model as merged_16bit, it can be used with VLLM.
+1. Install VLLM:
+```bash
+pip install vllm
 ```
+(VLLM installation might have specific CUDA and PyTorch version requirements. Refer to the VLLM documentation for the latest installation prerequisites.)
+2. Run the VLLM OpenAI-Compatible Server:
+You can serve the model using VLLM's OpenAI-compatible API server, making it easy to integrate into existing applications.
+```bash
+python -m vllm.entrypoints.openai.api_server \
+    --model beetlware/Bee1reason-arabic-Qwen-14B \
+    --tokenizer beetlware/Bee1reason-arabic-Qwen-14B \
+    --dtype bfloat16 \
+    --max-model-len 2048 \
+    # --tensor-parallel-size N  # If you have multiple GPUs
+    # --gpu-memory-utilization 0.9 # To adjust GPU memory usage
 ```
+- Replace --dtype bfloat16 with float16 if needed.
+- max-model-len should match the max_seq_length you used.
+3. Send Requests to the VLLM Server:
+Once the server is running (typically on http://localhost:8000), you can send requests using any OpenAI-compatible client, like the openai library:
+```python
+import openai
+client = openai.OpenAI(
+    base_url="http://localhost:8000/v1", # VLLM server address
+    api_key="dummy_key" # VLLM doesn't require an actual API key by default
+)
+completion = client.chat.completions.create(
+    model="beetlware/Bee1reason-arabic-Qwen-14B", # Model name as specified in VLLM
+    messages=[
+        {"role": "user", "content": "اشرح نظرية النسبية العامة بكلمات بسيطة."} # "Explain the theory of general relativity in simple terms."
+    ],
+    max_tokens=256,
+    temperature=0.7,
+    stream=True # To enable streaming
+)
+print("Streaming response from VLLM:")
+full_response = ""
+for chunk in completion:
+    if chunk.choices[0].delta.content is not None:
+        token = chunk.choices[0].delta.content
+        print(token, end="", flush=True)
+        full_response += token
+print("\n--- End of stream ---")
 ```
+# Limitations and Potential Biases
+The model's performance is highly dependent on the quality and diversity of the training data. It may exhibit biases present in the data it was trained on.
+Despite fine-tuning for logical reasoning, the model might still make errors on very complex or unfamiliar reasoning tasks.
+The model may "hallucinate" or produce incorrect information, especially for topics not well-covered in its training data.
+Capabilities in languages other than Arabic (if primarily trained on Arabic) might be limited.
+# Additional Information
+Developed by: [loai abdalslam/Organization - beetleware]
+Upload/Release Date: [21-5-2025]
+Contact / Issue Reporting: [[email protected]]
+# Beetleware :
+We are a software house and digital transformation service provider that was founded six years ago and is based in Saudi Arabia.
+All rights reserved@2025
+Our Offices
+KSA Office
+(+966) 54 597 3282
+[email protected]
+Egypt Office
+(+2) 010 67 256 306
+[email protected]
+Oman Office
+(+968) 9522 8632
+# Uploaded  model
+- **Developed by:** beetlware AI Team
+- **License:** apache-2.0
+- **Finetuned from model :** unsloth/qwen3-14b-unsloth-bnb-4bit
+This qwen3 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
+[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)