Update README.md
Browse files
README.md
CHANGED
@@ -69,16 +69,16 @@ The primary goal of this PoC was to **test the hypothesis** that combining Reinf
|
|
69 |
|
70 |
## Methodology: Ontology-Guided Reinforcement Learning
|
71 |
|
72 |
-
Unlike standard Supervised Fine-Tuning (SFT) which primarily teaches mimicry, we employed Reinforcement Learning (RL) to explicitly teach the model *how* to reason.
|
73 |
|
74 |
* **Base Model:** `unsloth/gemma-3-4b-it-bnb-4bit` (providing foundational language capabilities).
|
75 |
-
* **Structured Knowledge:** The **SEOntology (seovoc)**, an ontology defining key SEO entities, properties, and relationships, served as the structured knowledge base.
|
76 |
* **Learning Method:** Group Relative Policy Optimization (GRPO) via the `trl` library, accelerated with Unsloth. GRPO was chosen to optimize the policy (the model's generation strategy) directly based on reward signals.
|
77 |
* **Ontology-Guided Reward Signal:** This is the core of the methodology. A custom reward function was designed, utilizing an LLM-as-a-Judge (Gemini 1.5 Pro). This judge evaluated the model's generated `<reasoning>` and `<answer>` based on several criteria, **crucially including alignment with SEO best practices and the explicit use/implication of relevant concepts from the `seovoc` ontology**. Models were rewarded for outputs demonstrating logical steps consistent with the knowledge structured in the ontology.
|
78 |
|
79 |
## Fine-tuning Details
|
80 |
|
81 |
-
* **Dataset:** A custom synthetic dataset (`cyberandy/seo-grpo-reasoning-dataset-1000` containing ~960 cleaned examples)
|
82 |
* **Training Steps:** `500` steps.
|
83 |
* **Key Hyperparameters:**
|
84 |
* Learning Rate: `5e-6` (cosine decay)
|
@@ -281,7 +281,7 @@ However, performance gaps compared to state-of-the-art models (like GPT-4o) were
|
|
281 |
|
282 |
LLM-as-a-Judge (Gemini 1.5 Pro) scores reflected this, indicating stronger performance on simpler, more structured tasks and lower scores on complex reasoning and strict format adherence under stress.
|
283 |
|
284 |
-
**Further details on the methodology and evaluation
|
285 |
|
286 |
## Intended Use & Purpose
|
287 |
|
@@ -314,4 +314,4 @@ Use this model responsibly. The authors are not liable for any decisions made ba
|
|
314 |
* Developed by the WordLift team, pushing the boundaries of [Agentic SEO](https://wordlift.io/agent/) and [Marketing Automation](https://wordlift.io/agent/).
|
315 |
* Built upon Google's Gemma 3 model and the Unsloth library for efficient fine-tuning.
|
316 |
* Leverages concepts from schema.org and the SEOntology (seovoc).
|
317 |
-
* Methodology
|
|
|
69 |
|
70 |
## Methodology: Ontology-Guided Reinforcement Learning
|
71 |
|
72 |
+
This novel methodology, which leverages structured knowledge from a domain-specific ontology to guide Reinforcement Learning, was first presented at the Knowledge Graph Conference (KGC). Unlike standard Supervised Fine-Tuning (SFT) which primarily teaches mimicry, we employed Reinforcement Learning (RL) to explicitly teach the model *how* to reason effectively within the SEO domain.
|
73 |
|
74 |
* **Base Model:** `unsloth/gemma-3-4b-it-bnb-4bit` (providing foundational language capabilities).
|
75 |
+
* **Structured Knowledge:** The **SEOntology (seovoc)**, an ontology defining key SEO entities, properties, and relationships ([https://w3id.org/seovoc/](https://w3id.org/seovoc/)), served as the structured knowledge base.
|
76 |
* **Learning Method:** Group Relative Policy Optimization (GRPO) via the `trl` library, accelerated with Unsloth. GRPO was chosen to optimize the policy (the model's generation strategy) directly based on reward signals.
|
77 |
* **Ontology-Guided Reward Signal:** This is the core of the methodology. A custom reward function was designed, utilizing an LLM-as-a-Judge (Gemini 1.5 Pro). This judge evaluated the model's generated `<reasoning>` and `<answer>` based on several criteria, **crucially including alignment with SEO best practices and the explicit use/implication of relevant concepts from the `seovoc` ontology**. Models were rewarded for outputs demonstrating logical steps consistent with the knowledge structured in the ontology.
|
78 |
|
79 |
## Fine-tuning Details
|
80 |
|
81 |
+
* **Dataset:** A custom synthetic dataset (`cyberandy/seo-grpo-reasoning-dataset-1000` containing ~960 cleaned examples). This dataset was programmatically generated using Gemini 1.5 Pro, based on detailed task templates that explicitly referenced and incorporated concepts from the SEOntology (`seovoc`). The generation process created pairs of input data, step-by-step reasoning (`<reasoning>...</reasoning>`), and a concise answer (`<answer>...</answer>`) for various SEO tasks (Meta Description Optimization, Internal Link Suggestion, Query Trend Analysis, Schema.org Suggestion, NER, Title Optimization, Intent Classification, Robots.txt Rules, Canonicalization, E-E-A-T Assessment, GMB Optimization, Product Schema Enhancement, Content Revision based on QA). These generated examples were then evaluated by an LLM-as-a-Judge (also Gemini 1.5 Pro), which assigned a reward score (between 0.0 and 1.0) based on the accuracy, relevance, format correctness, and **alignment of the reasoning and answer with the seovoc ontology concepts** presented as context to the judge. This scored data was then formatted into `{'prompt': '...', 'reward': float}` pairs for the GRPO training. You can read more about the dataset generation and evaluation methodology in our blog post (linking to the KGC material): [An Ontology-Driven Approach to Train Your Next SEO Agent](https://wordlift.io/blog/en/entity/knowledge-graph-conference/).
|
82 |
* **Training Steps:** `500` steps.
|
83 |
* **Key Hyperparameters:**
|
84 |
* Learning Rate: `5e-6` (cosine decay)
|
|
|
281 |
|
282 |
LLM-as-a-Judge (Gemini 1.5 Pro) scores reflected this, indicating stronger performance on simpler, more structured tasks and lower scores on complex reasoning and strict format adherence under stress.
|
283 |
|
284 |
+
**Further details on the methodology and evaluation has been presented at the Knowledge Graph Conference (KGC) 2025.**
|
285 |
|
286 |
## Intended Use & Purpose
|
287 |
|
|
|
314 |
* Developed by the WordLift team, pushing the boundaries of [Agentic SEO](https://wordlift.io/agent/) and [Marketing Automation](https://wordlift.io/agent/).
|
315 |
* Built upon Google's Gemma 3 model and the Unsloth library for efficient fine-tuning.
|
316 |
* Leverages concepts from schema.org and the SEOntology (seovoc).
|
317 |
+
* Methodology presented at the [Knowledge Graph Conference](https://wordlift.io/blog/en/entity/knowledge-graph-conference/) 2025 (KGC).
|