Qwen3-8B-MultiStage-Finetune-Hybrid

Model Description

This is a fine-tuned version of the Qwen/Qwen3-8B large language model. It's specialized through a multi-stage training pipeline focusing on medical reasoning, mathematical problem-solving, and general conversational abilities. The model was trained using QLoRA (Quantized Low-Rank Adaptation) and GRPO (Generative Reinforcement Learning with Policy Optimization) for both efficiency and enhanced performance in its specialized domains.

The training methodology uses a progressive approach, building capabilities in distinct areas before consolidating them:

Medical Reasoning SFT: Initial fine-tuning on a specialized medical dataset to adapt the model to medical explanations and reasoning.
Mathematical SFT: Further fine-tuning on a mathematical dataset to enhance its ability to solve math problems.
Mathematical GRPO: A reinforcement learning stage that leverages a reward function to optimize the model's accuracy and ability to provide structured mathematical solutions, particularly with answers in a \boxed{} format.
General Chat SFT: Final fine-tuning on a diverse chat dataset to improve conversational fluency, helpfulness, and alignment with common dialogue patterns.

Training Details

Training Data

The model was trained on a carefully selected set of public datasets:

Medical Reasoning: FreedomIntelligence/medical-o1-reasoning-SFT
Mathematical Reasoning: unsloth/OpenMathReasoning-mini
General Conversation: mlabonne/guanaco-llama2-1k

Training Procedure

The model was fine-tuned using a hybrid approach that combines the efficient training capabilities of the unsloth library with the advanced reinforcement learning features of trl:

Base Model: Qwen/Qwen3-8B
Quantization: 4-bit NormalFloat (NF4) with double quantization enabled (bnb_4bit_use_double_quant=True), allowing for efficient training on limited GPU memory.
LoRA Configuration: A rank of r=24, lora_alpha=32, and lora_dropout=0.05 was applied. Key attention and feed-forward projection layers (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj) were targeted for adaptation.
Gradient Checkpointing: Enabled for memory efficiency, with recompute_grad=True for Unsloth-specific optimizations.
Dynamic Hyperparameters: Batch sizes and gradient accumulation steps were adaptively adjusted per training stage to optimize GPU memory utilization and training throughput.
Learning Rate Schedule: A cosine decay schedule with a warmup ratio was used. Learning rates were customized for each training stage.
Optimizer: adamw_8bit.
Regularization: Gradient norm clipping (max_grad_norm=1.0) and weight decay (0.01) were applied to prevent exploding gradients and overfitting.
Early Stopping: Applied during SFT stages with a patience of 2 on validation loss, stopping training if no significant improvement was observed.
Hardware: Training was performed on a single GPU.
Software Stack: Python, Hugging Face transformers, unsloth, trl, datasets, wandb (for experiment tracking), and vllm (used during the GRPO stage for efficient text generation).

Usage

This model is designed for text generation, particularly in response to chat-based prompts or specific medical and mathematical queries. To get the best results, ensure your prompts are formatted correctly following the model's training structure.

Load the Model

import torch
from transformers import AutoTokenizer, BitsAndBytesConfig
from unsloth import FastLanguageModel

# Configuration parameters (matching training)
MAX_SEQ_LENGTH = 2048
LOAD_IN_4BIT = True
USE_DOUBLE_QUANT = True 

# Initialize BitsAndBytesConfig as used during training
bnb_config = BitsAndBytesConfig(
    load_in_4bit=LOAD_IN_4BIT,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    bnb_4bit_use_double_quant=USE_DOUBLE_QUANT,
)

# Replace with the actual path to your uploaded model on Hugging Face Hub
model_id = "your-huggingface-username/Qwen3-8B-MultiStage-Finetune-Hybrid" 

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token # Ensure pad token is set for generation

# Load the model using Unsloth's optimized loading
model = FastLanguageModel.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    max_seq_length=MAX_SEQ_LENGTH,
    device_map="auto", # Automatically maps model to available GPUs
)

# Example for a general chat interaction
messages = [
    {"role": "system", "content": "You are a friendly and helpful assistant."},
    {"role": "user", "content": "Tell me a short, funny story about a clumsy robot."},
]

# Apply the chat template and tokenize inputs
input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True, # Important: Add the prompt for the assistant's turn
    return_tensors="pt"
).to("cuda") # Move inputs to GPU

# Generate outputs
outputs = model.generate(
    input_ids,
    max_new_tokens=512, # Maximum tokens to generate
    do_sample=True,     # Enable sampling for more diverse outputs
    temperature=0.7,    # Control randomness
    top_p=0.95          # Nucleus sampling
)

# Decode and print the generated text, skipping special tokens
print("--- General Chat Example ---")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Example for a math problem (model is trained to provide a \boxed{} answer)
math_messages = [
    {"role": "system", "content": "You are a math solver. Provide your reasoning within \\ and the final answer in \\boxed{} format."},
    {"role": "user", "content": "If a car travels at 80 km/h for 2.5 hours, and then at 60 km/h for another 1.5 hours, what is the total distance traveled?"},
]

# Apply math chat template and tokenize inputs
math_input_ids = tokenizer.apply_chat_template(
    math_messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

# Generate outputs for the math problem
math_outputs = model.generate(
    math_input_ids,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.6, # Slightly lower temperature for more deterministic math outputs
    top_p=0.9
)

print("\n--- Math Example ---")
print(tokenizer.decode(math_outputs[0], skip_special_tokens=True))


Limitations and Bias
As a large language model, this fine-tuned Qwen-8B model inherits general limitations and potential biases from its extensive pre-training and fine-tuning data:

Hallucinations: The model may generate information that is factually incorrect or nonsensical. Always cross-reference critical information.
Factual Accuracy: While specialized in medical and mathematical domains, it should not be used as a substitute for professional medical advice, complex mathematical proofs, or any domain requiring absolute precision without independent verification.
Bias: The model's outputs are influenced by the biases present in its training data (both the base model's pre-training and the fine-tuning datasets). This may manifest in stereotypical, harmful, or unfair content.
Language Proficiency: Primarily trained on English text. While some Spanish content was present in the general chat dataset, its proficiency in Spanish or other languages is not guaranteed and may vary.
Context Window: Limited by its max_seq_length (2048 tokens). Very long inputs or extensive multi-turn conversations might lead to degraded performance or truncation of context.
Ethical Considerations
Users should be aware of the following ethical considerations when deploying or using this model:

Not for Critical Applications: This model is intended for research, experimentation, and exploratory applications. It is not designed or validated for use in critical systems where accuracy, reliability, and safety are paramount (e.g., medical diagnosis, financial advice, legal counsel, or decision-making systems impacting individuals).
Responsible AI Use: Deploy and use this model responsibly, adhering to ethical AI guidelines and principles. Implement safeguards to monitor its outputs and prevent potential misuse, discrimination, or the generation of harmful content.
Data Privacy and Security: Do not use this model with sensitive personal identifiable information (PII) or confidential data. Ensure compliance with all relevant data privacy regulations.
Transparency: Be transparent with end-users when they are interacting with an AI system.
Citation
If you use this model or the training methodology, please consider citing the following key components:

Code snippet

@misc{qwen3,
  author = {Qwen Team},
  title = {Qwen3-8B},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)}}
}

@misc{unsloth,
  author = {Daniel Han},
  title = {Unsloth: Fast LLM Fine-tuning},
  year = {2023},
  publisher = {GitHub},
  howpublished = {\url{[https://github.com/unsloth/unsloth](https://github.com/unsloth/unsloth)}}
}

@misc{trl,
  author = {Hugging Face Team},
  title = {TRL: Transformer Reinforcement Learning},
  year = {2023},
  publisher = {GitHub},
  howpublished = {\url{[https://github.com/huggingface/trl](https://github.com/huggingface/trl)}}
}

@misc{medical_dataset,
  author = {FreedomIntelligence},
  title = {medical-o1-reasoning-SFT},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/FreedomIntelligence/medical-o1-reasoning-SFT](https://huggingface.co/FreedomIntelligence/medical-o1-reasoning-SFT)}}
}

@misc{openmathreasoning_mini,
  author = {unsloth},
  title = {OpenMathReasoning-mini},
  year = {2023},
  publisher = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/datasets/unsloth/OpenMathReasoning-mini](https://huggingface.co/datasets/unsloth/OpenMathReasoning-mini)}}
}

@misc{guanaco_llama2_1k,
  author = {mlabonne},
  title = {guanaco-llama2-1k},
  year = {2023},
  publisher = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k)}}
}

CultriX
/

Qwen3-8B-Hippocratesv1