---
license: apache-2.0 # Or appropriate license
language: en
library_name: transformers
pipeline_tag: text-generation
tags:
- gemma3
- unsloth
- grpo
- rlft
- seo
- reasoning
- instruction-tuning
- text-generation
- experimental
- wordlift
- seontology
---

# Gemma 3 1B - SEO Reasoning (GRPO Iteration 1) - `cyberandy/gemma3-1b-feliSEO-2run`

## Model Description

This model is an **experimental fine-tune** of the `unsloth/gemma-3-1b-it` model, developed as part of research within the **WordLift Lab**. It was trained using Group Policy Optimization (GRPO) with the `trl` library and Unsloth optimizations.

The primary goal of this fine-tuning iteration was to teach the model to perform basic SEO-related reasoning tasks, guided by concepts from the **SEOntology project (https://github.com/seontology/)**, and structure its output using specific XML-like tags: `<reasoning>` and `<answer>`.

**This is an early iteration based on a very small dataset (100 examples) and limited training steps. It demonstrates partial success in format adherence but requires significantly more data and training for robust content generation.**

*   **Developed by:** cyberandy (WordLift Lab)
*   **Base Model:** `unsloth/gemma-3-1b-it`
*   **Fine-tuning Method:** Group Policy Optimization (GRPO)
*   **Language(s):** English
*   **License:** Likely Apache 2.0 (inherited from base model, confirm license compatibility)

## Training Data

The model was fine-tuned on a custom dataset consisting of **100 examples**. This dataset was created through a two-step process:

1.  **Synthetic Generation:** Initial SEO task prompts (covering meta descriptions, schema suggestions, keyword analysis, etc., inspired by real-world SEO challenges) were used with the Gemini 1.5 Pro API to generate initial responses containing `<reasoning>` and `<answer>` sections.
2.  **LLM-as-a-Judge Evaluation:** The generated synthetic examples were then evaluated by Gemini 1.5 Pro (acting as a judge based on SEO best practices and concepts derived from the **SEOntology's `seovoc` vocabulary (https://w3id.org/seovoc/)**) to assign a reward score (0.0 to 1.0) to each example.
3.  **Final Format:** The data was processed into `{'prompt': str, 'reward': float}` format, where `prompt` contained the system and user turns formatted using the Gemma 3 chat template.

Due to the small dataset size, the diversity of SEO tasks covered is limited.

## Training Procedure

*   **Framework:** `unsloth`, `trl`, `peft`, `transformers`, `torch`
*   **Algorithm:** GRPO (`trl.GRPOTrainer`)
*   **Model:** `unsloth/gemma-3-1b-it` loaded without 4-bit quantization (float32/bfloat16) using `FastLanguageModel`.
*   **PEFT:** LoRA adapters applied with `r=8`, `lora_alpha=8`.
*   **Reward Functions:** Custom Python functions were used during training to provide rewards based on the model's generated completions:
    *   `soft_format_reward_func_partial`: Rewarded adherence to the `<reasoning>`/`<answer>` structure, with partial credit for individual tags.
    *   `ontology_keyword_reward_func_partial`: Gave small, diminishing rewards for finding relevant SEO/SEOntology keywords (e.g., 'seovoc:', 'schema.org', 'ctr', 'entity') within the `<reasoning>` section.
    *   Combined scores were rescaled using `tanh` into approximately `[-1, 1]`.
*   **Key Hyperparameters (`GRPOConfig`):**
    *   `learning_rate`: 5e-6
    *   `per_device_train_batch_size`: 2
    *   `gradient_accumulation_steps`: 2 (Effective batch size: 4)
    *   `gradient_checkpointing`: True (using Unsloth's implementation)
    *   `num_generations`: 2
    *   `max_prompt_length`: 512
    *   `max_completion_length`: 512
    *   `max_steps`: 500 (Intended, actual run might have varied)
    *   `optim`: adamw_8bit
    *   `lr_scheduler_type`: cosine
    *   `warmup_ratio`: 0.1
*   **Hardware:** NVIDIA A100 (40GB VRAM)

## Evaluation Results

*   **Quantitative:** No formal evaluation metrics were calculated for this iteration.
*   **Training Metrics:** Positive average rewards (~0.4-0.6+) were observed, indicating the reward functions provided a learning signal. Training completed the intended steps. W&B Run: [Link to your W&B run if public]
*   **Qualitative:** Inference shows the model **adheres well to the requested `<reasoning>/<answer>` format**. Reasoning content often shows logical steps. Answer content is sometimes relevant but can suffer from **hallucinations or incoherence**, especially near the end of generation. Requires further training on more diverse data and potentially reward function refinement.

## Intended Use

*   **Primary Use:** Research and experimentation by WordLift Lab on applying GRPO for teaching structured SEO reasoning guided by SEOntology concepts. Demonstrating the fine-tuning process.
*   **Out-of-Scope:** **NOT suitable for production use.** Reliability and factual accuracy are not guaranteed.

## Limitations and Bias

*   **Small, Synthetic Dataset:** Trained on only 100 synthetic examples. Generalization is limited.
*   **Content Quality:** Prone to hallucinations and incoherence. Factual accuracy not verified.
*   **Reward Function Simplicity:** Rewards focused on format and basic keywords, not deep semantic correctness.
*   **Training Instability/Speed:** Significant performance challenges were encountered during development, particularly with the 4B model. The successful 1B run required specific configurations impacting speed.
*   **Inherited Bias:** Inherits biases from the base `gemma-3-1b-it` model and the Gemini 1.5 Pro model used for data generation and evaluation.

## How to Use

Load the merged model (this repository contains the full 16-bit merged weights) using `transformers`:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import torch

model_id = "cyberandy/gemma3-1b-feliSEO-2run"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load in bfloat16 or float16
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, # Or torch.float16
    device_map="auto", # Use GPU if available
    attn_implementation="eager" # Recommended for Gemma 3 stability
)
model.eval()

# --- Define System Prompt Used During Training/Expected by Model ---
system_prompt = """Respond in the following format:
<reasoning>
Explain your thinking step-by-step. Use relevant SEO concepts.
</reasoning>
<answer>
Provide the final answer to the question or the requested SEO element.
</answer>"""

# --- Prepare Input ---
user_query = "What schema.org type should be used for a local dental clinic's homepage?"
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_query}
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt"
).to(model.device)

# --- Configure Generation ---
gen_config = GenerationConfig(
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id if tokenizer.eos_token_id is not None else 1, # Use EOS ID for pad
    eos_token_id=tokenizer.eos_token_id if tokenizer.eos_token_id is not None else 1,
)

# --- Generate ---
with torch.no_grad():
    outputs = model.generate(input_ids=inputs, generation_config=gen_config)

# --- Decode ---
generated_ids = outputs[0][inputs.shape[-1]:]
generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)

print(f"Prompt: {user_query}")
print("-" * 20)
print(f"Generated Output:\n{generated_text}")