Gemma 3 1B - SEO Reasoning (GRPO Iteration 1) - `cyberandy/gemma3-1b-feliSEO-2run`

Model Description

This model is an experimental fine-tune of the unsloth/gemma-3-1b-it model, developed as part of research within the WordLift Lab. It was trained using Group Policy Optimization (GRPO) with the trl library and Unsloth optimizations.

The primary goal of this fine-tuning iteration was to teach the model to perform basic SEO-related reasoning tasks, guided by concepts from the SEOntology project (https://github.com/seontology/), and structure its output using specific XML-like tags: <reasoning> and <answer>.

This is an early iteration based on a very small dataset (100 examples) and limited training steps. It demonstrates partial success in format adherence but requires significantly more data and training for robust content generation.

Developed by: cyberandy (WordLift Lab)
Base Model: unsloth/gemma-3-1b-it
Fine-tuning Method: Group Policy Optimization (GRPO)
Language(s): English
License: Likely Apache 2.0 (inherited from base model, confirm license compatibility)

Training Data

The model was fine-tuned on a custom dataset consisting of 100 examples. This dataset was created through a two-step process:

Synthetic Generation: Initial SEO task prompts (covering meta descriptions, schema suggestions, keyword analysis, etc., inspired by real-world SEO challenges) were used with the Gemini 1.5 Pro API to generate initial responses containing <reasoning> and <answer> sections.
LLM-as-a-Judge Evaluation: The generated synthetic examples were then evaluated by Gemini 1.5 Pro (acting as a judge based on SEO best practices and concepts derived from the SEOntology's seovoc vocabulary (https://w3id.org/seovoc/)) to assign a reward score (0.0 to 1.0) to each example.
Final Format: The data was processed into {'prompt': str, 'reward': float} format, where prompt contained the system and user turns formatted using the Gemma 3 chat template.

Due to the small dataset size, the diversity of SEO tasks covered is limited.

Training Procedure

Framework: unsloth, trl, peft, transformers, torch
Algorithm: GRPO (trl.GRPOTrainer)
Model: unsloth/gemma-3-1b-it loaded without 4-bit quantization (float32/bfloat16) using FastLanguageModel.
PEFT: LoRA adapters applied with r=8, lora_alpha=8.
Reward Functions: Custom Python functions were used during training to provide rewards based on the model's generated completions:
- soft_format_reward_func_partial: Rewarded adherence to the <reasoning>/<answer> structure, with partial credit for individual tags.
- ontology_keyword_reward_func_partial: Gave small, diminishing rewards for finding relevant SEO/SEOntology keywords (e.g., 'seovoc:', 'schema.org', 'ctr', 'entity') within the <reasoning> section.
- Combined scores were rescaled using tanh into approximately [-1, 1].
Key Hyperparameters (GRPOConfig):
- learning_rate: 5e-6
- per_device_train_batch_size: 2
- gradient_accumulation_steps: 2 (Effective batch size: 4)
- gradient_checkpointing: True (using Unsloth's implementation)
- num_generations: 2
- max_prompt_length: 512
- max_completion_length: 512
- max_steps: 500 (Intended, actual run might have varied)
- optim: adamw_8bit
- lr_scheduler_type: cosine
- warmup_ratio: 0.1
Hardware: NVIDIA A100 (40GB VRAM)

Evaluation Results

Quantitative: No formal evaluation metrics were calculated for this iteration.
Training Metrics: Positive average rewards (~0.4-0.6+) were observed, indicating the reward functions provided a learning signal. Training completed the intended steps. W&B Run: [Link to your W&B run if public]
Qualitative: Inference shows the model adheres well to the requested <reasoning>/<answer> format. Reasoning content often shows logical steps. Answer content is sometimes relevant but can suffer from hallucinations or incoherence, especially near the end of generation. Requires further training on more diverse data and potentially reward function refinement.

Intended Use

Primary Use: Research and experimentation by WordLift Lab on applying GRPO for teaching structured SEO reasoning guided by SEOntology concepts. Demonstrating the fine-tuning process.
Out-of-Scope: NOT suitable for production use. Reliability and factual accuracy are not guaranteed.

Limitations and Bias

Small, Synthetic Dataset: Trained on only 100 synthetic examples. Generalization is limited.
Content Quality: Prone to hallucinations and incoherence. Factual accuracy not verified.
Reward Function Simplicity: Rewards focused on format and basic keywords, not deep semantic correctness.
Training Instability/Speed: Significant performance challenges were encountered during development, particularly with the 4B model. The successful 1B run required specific configurations impacting speed.
Inherited Bias: Inherits biases from the base gemma-3-1b-it model and the Gemini 1.5 Pro model used for data generation and evaluation.

How to Use

Load the merged model (this repository contains the full 16-bit merged weights) using transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import torch

model_id = "cyberandy/gemma3-1b-feliSEO-2run"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load in bfloat16 or float16
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, # Or torch.float16
    device_map="auto", # Use GPU if available
    attn_implementation="eager" # Recommended for Gemma 3 stability
)
model.eval()

# --- Define System Prompt Used During Training/Expected by Model ---
system_prompt = """Respond in the following format:
<reasoning>
Explain your thinking step-by-step. Use relevant SEO concepts.
</reasoning>
<answer>
Provide the final answer to the question or the requested SEO element.
</answer>"""

# --- Prepare Input ---
user_query = "What schema.org type should be used for a local dental clinic's homepage?"
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_query}
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt"
).to(model.device)

# --- Configure Generation ---
gen_config = GenerationConfig(
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id if tokenizer.eos_token_id is not None else 1, # Use EOS ID for pad
    eos_token_id=tokenizer.eos_token_id if tokenizer.eos_token_id is not None else 1,
)

# --- Generate ---
with torch.no_grad():
    outputs = model.generate(input_ids=inputs, generation_config=gen_config)

# --- Decode ---
generated_ids = outputs[0][inputs.shape[-1]:]
generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)

print(f"Prompt: {user_query}")
print("-" * 20)
print(f"Generated Output:\n{generated_text}")

Gemma 3 1B - SEO Reasoning (GRPO Iteration 1) - cyberandy/gemma3-1b-feliSEO-2run