--- license: apache-2.0 # Or appropriate license language: en library_name: transformers pipeline_tag: text-generation tags: - gemma3 - unsloth - grpo - rlft - seo - reasoning - instruction-tuning - text-generation - experimental - wordlift - seontology --- # Gemma 3 1B - SEO Reasoning (GRPO Iteration 1) - `cyberandy/gemma3-1b-feliSEO-2run` ## Model Description This model is an **experimental fine-tune** of the `unsloth/gemma-3-1b-it` model, developed as part of research within the **WordLift Lab**. It was trained using Group Policy Optimization (GRPO) with the `trl` library and Unsloth optimizations. The primary goal of this fine-tuning iteration was to teach the model to perform basic SEO-related reasoning tasks, guided by concepts from the **SEOntology project (https://github.com/seontology/)**, and structure its output using specific XML-like tags: `` and ``. **This is an early iteration based on a very small dataset (100 examples) and limited training steps. It demonstrates partial success in format adherence but requires significantly more data and training for robust content generation.** * **Developed by:** cyberandy (WordLift Lab) * **Base Model:** `unsloth/gemma-3-1b-it` * **Fine-tuning Method:** Group Policy Optimization (GRPO) * **Language(s):** English * **License:** Likely Apache 2.0 (inherited from base model, confirm license compatibility) ## Training Data The model was fine-tuned on a custom dataset consisting of **100 examples**. This dataset was created through a two-step process: 1. **Synthetic Generation:** Initial SEO task prompts (covering meta descriptions, schema suggestions, keyword analysis, etc., inspired by real-world SEO challenges) were used with the Gemini 1.5 Pro API to generate initial responses containing `` and `` sections. 2. **LLM-as-a-Judge Evaluation:** The generated synthetic examples were then evaluated by Gemini 1.5 Pro (acting as a judge based on SEO best practices and concepts derived from the **SEOntology's `seovoc` vocabulary (https://w3id.org/seovoc/)**) to assign a reward score (0.0 to 1.0) to each example. 3. **Final Format:** The data was processed into `{'prompt': str, 'reward': float}` format, where `prompt` contained the system and user turns formatted using the Gemma 3 chat template. Due to the small dataset size, the diversity of SEO tasks covered is limited. ## Training Procedure * **Framework:** `unsloth`, `trl`, `peft`, `transformers`, `torch` * **Algorithm:** GRPO (`trl.GRPOTrainer`) * **Model:** `unsloth/gemma-3-1b-it` loaded without 4-bit quantization (float32/bfloat16) using `FastLanguageModel`. * **PEFT:** LoRA adapters applied with `r=8`, `lora_alpha=8`. * **Reward Functions:** Custom Python functions were used during training to provide rewards based on the model's generated completions: * `soft_format_reward_func_partial`: Rewarded adherence to the ``/`` structure, with partial credit for individual tags. * `ontology_keyword_reward_func_partial`: Gave small, diminishing rewards for finding relevant SEO/SEOntology keywords (e.g., 'seovoc:', 'schema.org', 'ctr', 'entity') within the `` section. * Combined scores were rescaled using `tanh` into approximately `[-1, 1]`. * **Key Hyperparameters (`GRPOConfig`):** * `learning_rate`: 5e-6 * `per_device_train_batch_size`: 2 * `gradient_accumulation_steps`: 2 (Effective batch size: 4) * `gradient_checkpointing`: True (using Unsloth's implementation) * `num_generations`: 2 * `max_prompt_length`: 512 * `max_completion_length`: 512 * `max_steps`: 500 (Intended, actual run might have varied) * `optim`: adamw_8bit * `lr_scheduler_type`: cosine * `warmup_ratio`: 0.1 * **Hardware:** NVIDIA A100 (40GB VRAM) ## Evaluation Results * **Quantitative:** No formal evaluation metrics were calculated for this iteration. * **Training Metrics:** Positive average rewards (~0.4-0.6+) were observed, indicating the reward functions provided a learning signal. Training completed the intended steps. W&B Run: [Link to your W&B run if public] * **Qualitative:** Inference shows the model **adheres well to the requested `/` format**. Reasoning content often shows logical steps. Answer content is sometimes relevant but can suffer from **hallucinations or incoherence**, especially near the end of generation. Requires further training on more diverse data and potentially reward function refinement. ## Intended Use * **Primary Use:** Research and experimentation by WordLift Lab on applying GRPO for teaching structured SEO reasoning guided by SEOntology concepts. Demonstrating the fine-tuning process. * **Out-of-Scope:** **NOT suitable for production use.** Reliability and factual accuracy are not guaranteed. ## Limitations and Bias * **Small, Synthetic Dataset:** Trained on only 100 synthetic examples. Generalization is limited. * **Content Quality:** Prone to hallucinations and incoherence. Factual accuracy not verified. * **Reward Function Simplicity:** Rewards focused on format and basic keywords, not deep semantic correctness. * **Training Instability/Speed:** Significant performance challenges were encountered during development, particularly with the 4B model. The successful 1B run required specific configurations impacting speed. * **Inherited Bias:** Inherits biases from the base `gemma-3-1b-it` model and the Gemini 1.5 Pro model used for data generation and evaluation. ## How to Use Load the merged model (this repository contains the full 16-bit merged weights) using `transformers`: ```python from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig import torch model_id = "cyberandy/gemma3-1b-feliSEO-2run" device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained(model_id) # Load in bfloat16 or float16 model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, # Or torch.float16 device_map="auto", # Use GPU if available attn_implementation="eager" # Recommended for Gemma 3 stability ) model.eval() # --- Define System Prompt Used During Training/Expected by Model --- system_prompt = """Respond in the following format: Explain your thinking step-by-step. Use relevant SEO concepts. Provide the final answer to the question or the requested SEO element. """ # --- Prepare Input --- user_query = "What schema.org type should be used for a local dental clinic's homepage?" messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_query} ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_tensors="pt" ).to(model.device) # --- Configure Generation --- gen_config = GenerationConfig( max_new_tokens=512, temperature=0.7, top_p=0.95, do_sample=True, pad_token_id=tokenizer.eos_token_id if tokenizer.eos_token_id is not None else 1, # Use EOS ID for pad eos_token_id=tokenizer.eos_token_id if tokenizer.eos_token_id is not None else 1, ) # --- Generate --- with torch.no_grad(): outputs = model.generate(input_ids=inputs, generation_config=gen_config) # --- Decode --- generated_ids = outputs[0][inputs.shape[-1]:] generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True) print(f"Prompt: {user_query}") print("-" * 20) print(f"Generated Output:\n{generated_text}")