metadata

base_model: bleta-meditor-27b
tags:
  - text-generation-inference
  - transformers
  - gemma3
  - reasoning
  - mathematics
  - grpo
license: apache-2.0
language:
  - al
inference:
  parameters:
    temperature: 0.7
    top_p: 0.95
    top_k: 64
    max_new_tokens: 512

Gemma 3 27B GRPO Reasoning Model

Model Description

Developed by: klei aliaj
Model type: Gemma 3 27B fine-tuned with GRPO for reasoning tasks
License: apache-2.0
Finetuned from model: Google's Gemma 3 27B instruction-tuned model
Framework: Hugging Face Transformers

This model is a fine-tuned version of Google's Gemma 3 27B instruction-tuned model, enhanced using Generative Rejection Policy Optimization (GRPO) to improve its reasoning capabilities.

Capabilities & Training

Fine-tuning Approach

This model was fine-tuned using GRPO (Generative Rejection Policy Optimization), a reinforcement learning technique that trains models to optimize for specific reward functions. The model was trained to:

Follow a specific reasoning format with dedicated sections for workings and solutions
Produce correct mathematical solutions
Show clear step-by-step reasoning processes

Special Formatting

The model has been trained to follow a specific reasoning format:

Working out/reasoning sections are enclosed within <start_working_out> and <end_working_out> tags
Final solutions are provided between <SOLUTION> and </SOLUTION> tags

Training Configuration

Framework: Hugging Face's TRL library
Optimization: LoRA fine-tuning (r=8, alpha=8)
Reward Functions: Format adherence, answer accuracy, and reasoning quality

Technical Specifications

Available Formats

This model is available in two formats:

Standard adapter format (adapter_model.safetensors)
GGUF 8-bit quantized format (bleta-meditor-27b-finetune.Q8_0.gguf) for use with llama.cpp

Gemma 3 Architecture Benefits

27B parameters, trained on 14 trillion tokens
128K context window (extended from 32K)
QK normalization (replaced attention softcapping)
5 sliding + 1 global attention pattern
1024 sliding window attention

Limitations

While this model excels at reasoning tasks, particularly mathematical problems, it may still occasionally provide incorrect solutions for complex problems.
The model's performance might vary depending on problem complexity and wording.
Like all language models, it may occasionally hallucinate or provide incorrect information outside its training domain.

Acknowledgments

Google for developing the Gemma 3 model family
Hugging Face for their TRL library and GRPO implementation

Citation

If you use this model in your research, please cite:

@misc{klei1_gemma3_grpo,
  author = {klei1},
  title = {Gemma 3 27B GRPO Reasoning Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/klei1/gemma-3-27b-grpo}}
}