cyberandy
/

SEOcrate-4B_grpo_new_01

Model card Files Files and versions Community

cyberandy commited on May 8

Commit

2b947ec

verified ·

1 Parent(s): e5f0ffd

Update README.md

Browse files

Files changed (1) hide show

README.md +5 -5

README.md CHANGED Viewed

@@ -69,16 +69,16 @@ The primary goal of this PoC was to **test the hypothesis** that combining Reinf
 ## Methodology: Ontology-Guided Reinforcement Learning
-Unlike standard Supervised Fine-Tuning (SFT) which primarily teaches mimicry, we employed Reinforcement Learning (RL) to explicitly teach the model *how* to reason.
 *   **Base Model:** `unsloth/gemma-3-4b-it-bnb-4bit` (providing foundational language capabilities).
-*   **Structured Knowledge:** The **SEOntology (seovoc)**, an ontology defining key SEO entities, properties, and relationships, served as the structured knowledge base.
 *   **Learning Method:** Group Relative Policy Optimization (GRPO) via the `trl` library, accelerated with Unsloth. GRPO was chosen to optimize the policy (the model's generation strategy) directly based on reward signals.
 *   **Ontology-Guided Reward Signal:** This is the core of the methodology. A custom reward function was designed, utilizing an LLM-as-a-Judge (Gemini 1.5 Pro). This judge evaluated the model's generated `<reasoning>` and `<answer>` based on several criteria, **crucially including alignment with SEO best practices and the explicit use/implication of relevant concepts from the `seovoc` ontology**. Models were rewarded for outputs demonstrating logical steps consistent with the knowledge structured in the ontology.
 ## Fine-tuning Details
-*   **Dataset:** A custom synthetic dataset (`cyberandy/seo-grpo-reasoning-dataset-1000` containing ~960 cleaned examples) covering SEO tasks like Meta Description Optimization, Internal Link Suggestion, Query Trend Analysis, Schema.org Suggestion, NER, Title Optimization, Intent Classification, Robots.txt Rules, Canonicalization, E-E-A-T Assessment.
 *   **Training Steps:** `500` steps.
 *   **Key Hyperparameters:**
     *   Learning Rate: `5e-6` (cosine decay)
@@ -281,7 +281,7 @@ However, performance gaps compared to state-of-the-art models (like GPT-4o) were
 LLM-as-a-Judge (Gemini 1.5 Pro) scores reflected this, indicating stronger performance on simpler, more structured tasks and lower scores on complex reasoning and strict format adherence under stress.
-**Further details on the methodology and evaluation will be presented at the Knowledge Graph Conference (KGC) 2025.**
 ## Intended Use & Purpose
@@ -314,4 +314,4 @@ Use this model responsibly. The authors are not liable for any decisions made ba
 *   Developed by the WordLift team, pushing the boundaries of [Agentic SEO](https://wordlift.io/agent/) and [Marketing Automation](https://wordlift.io/agent/).
 *   Built upon Google's Gemma 3 model and the Unsloth library for efficient fine-tuning.
 *   Leverages concepts from schema.org and the SEOntology (seovoc).
-*   Methodology to be presented at the Knowledge Graph Conference (KGC).

 ## Methodology: Ontology-Guided Reinforcement Learning
+This novel methodology, which leverages structured knowledge from a domain-specific ontology to guide Reinforcement Learning, was first presented at the Knowledge Graph Conference (KGC). Unlike standard Supervised Fine-Tuning (SFT) which primarily teaches mimicry, we employed Reinforcement Learning (RL) to explicitly teach the model *how* to reason effectively within the SEO domain.
 *   **Base Model:** `unsloth/gemma-3-4b-it-bnb-4bit` (providing foundational language capabilities).
+*   **Structured Knowledge:** The **SEOntology (seovoc)**, an ontology defining key SEO entities, properties, and relationships ([https://w3id.org/seovoc/](https://w3id.org/seovoc/)), served as the structured knowledge base.
 *   **Learning Method:** Group Relative Policy Optimization (GRPO) via the `trl` library, accelerated with Unsloth. GRPO was chosen to optimize the policy (the model's generation strategy) directly based on reward signals.
 *   **Ontology-Guided Reward Signal:** This is the core of the methodology. A custom reward function was designed, utilizing an LLM-as-a-Judge (Gemini 1.5 Pro). This judge evaluated the model's generated `<reasoning>` and `<answer>` based on several criteria, **crucially including alignment with SEO best practices and the explicit use/implication of relevant concepts from the `seovoc` ontology**. Models were rewarded for outputs demonstrating logical steps consistent with the knowledge structured in the ontology.
 ## Fine-tuning Details
+*   **Dataset:** A custom synthetic dataset (`cyberandy/seo-grpo-reasoning-dataset-1000` containing ~960 cleaned examples). This dataset was programmatically generated using Gemini 1.5 Pro, based on detailed task templates that explicitly referenced and incorporated concepts from the SEOntology (`seovoc`). The generation process created pairs of input data, step-by-step reasoning (`<reasoning>...</reasoning>`), and a concise answer (`<answer>...</answer>`) for various SEO tasks (Meta Description Optimization, Internal Link Suggestion, Query Trend Analysis, Schema.org Suggestion, NER, Title Optimization, Intent Classification, Robots.txt Rules, Canonicalization, E-E-A-T Assessment, GMB Optimization, Product Schema Enhancement, Content Revision based on QA). These generated examples were then evaluated by an LLM-as-a-Judge (also Gemini 1.5 Pro), which assigned a reward score (between 0.0 and 1.0) based on the accuracy, relevance, format correctness, and **alignment of the reasoning and answer with the seovoc ontology concepts** presented as context to the judge. This scored data was then formatted into `{'prompt': '...', 'reward': float}` pairs for the GRPO training. You can read more about the dataset generation and evaluation methodology in our blog post (linking to the KGC material): [An Ontology-Driven Approach to Train Your Next SEO Agent](https://wordlift.io/blog/en/entity/knowledge-graph-conference/).
 *   **Training Steps:** `500` steps.
 *   **Key Hyperparameters:**
     *   Learning Rate: `5e-6` (cosine decay)
 LLM-as-a-Judge (Gemini 1.5 Pro) scores reflected this, indicating stronger performance on simpler, more structured tasks and lower scores on complex reasoning and strict format adherence under stress.
+**Further details on the methodology and evaluation has been presented at the Knowledge Graph Conference (KGC) 2025.**
 ## Intended Use & Purpose
 *   Developed by the WordLift team, pushing the boundaries of [Agentic SEO](https://wordlift.io/agent/) and [Marketing Automation](https://wordlift.io/agent/).
 *   Built upon Google's Gemma 3 model and the Unsloth library for efficient fine-tuning.
 *   Leverages concepts from schema.org and the SEOntology (seovoc).
+*   Methodology presented at the [Knowledge Graph Conference](https://wordlift.io/blog/en/entity/knowledge-graph-conference/) 2025 (KGC).