cyberandy commited on
Commit
2b947ec
·
verified ·
1 Parent(s): e5f0ffd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -69,16 +69,16 @@ The primary goal of this PoC was to **test the hypothesis** that combining Reinf
69
 
70
  ## Methodology: Ontology-Guided Reinforcement Learning
71
 
72
- Unlike standard Supervised Fine-Tuning (SFT) which primarily teaches mimicry, we employed Reinforcement Learning (RL) to explicitly teach the model *how* to reason.
73
 
74
  * **Base Model:** `unsloth/gemma-3-4b-it-bnb-4bit` (providing foundational language capabilities).
75
- * **Structured Knowledge:** The **SEOntology (seovoc)**, an ontology defining key SEO entities, properties, and relationships, served as the structured knowledge base.
76
  * **Learning Method:** Group Relative Policy Optimization (GRPO) via the `trl` library, accelerated with Unsloth. GRPO was chosen to optimize the policy (the model's generation strategy) directly based on reward signals.
77
  * **Ontology-Guided Reward Signal:** This is the core of the methodology. A custom reward function was designed, utilizing an LLM-as-a-Judge (Gemini 1.5 Pro). This judge evaluated the model's generated `<reasoning>` and `<answer>` based on several criteria, **crucially including alignment with SEO best practices and the explicit use/implication of relevant concepts from the `seovoc` ontology**. Models were rewarded for outputs demonstrating logical steps consistent with the knowledge structured in the ontology.
78
 
79
  ## Fine-tuning Details
80
 
81
- * **Dataset:** A custom synthetic dataset (`cyberandy/seo-grpo-reasoning-dataset-1000` containing ~960 cleaned examples) covering SEO tasks like Meta Description Optimization, Internal Link Suggestion, Query Trend Analysis, Schema.org Suggestion, NER, Title Optimization, Intent Classification, Robots.txt Rules, Canonicalization, E-E-A-T Assessment.
82
  * **Training Steps:** `500` steps.
83
  * **Key Hyperparameters:**
84
  * Learning Rate: `5e-6` (cosine decay)
@@ -281,7 +281,7 @@ However, performance gaps compared to state-of-the-art models (like GPT-4o) were
281
 
282
  LLM-as-a-Judge (Gemini 1.5 Pro) scores reflected this, indicating stronger performance on simpler, more structured tasks and lower scores on complex reasoning and strict format adherence under stress.
283
 
284
- **Further details on the methodology and evaluation will be presented at the Knowledge Graph Conference (KGC) 2025.**
285
 
286
  ## Intended Use & Purpose
287
 
@@ -314,4 +314,4 @@ Use this model responsibly. The authors are not liable for any decisions made ba
314
  * Developed by the WordLift team, pushing the boundaries of [Agentic SEO](https://wordlift.io/agent/) and [Marketing Automation](https://wordlift.io/agent/).
315
  * Built upon Google's Gemma 3 model and the Unsloth library for efficient fine-tuning.
316
  * Leverages concepts from schema.org and the SEOntology (seovoc).
317
- * Methodology to be presented at the Knowledge Graph Conference (KGC).
 
69
 
70
  ## Methodology: Ontology-Guided Reinforcement Learning
71
 
72
+ This novel methodology, which leverages structured knowledge from a domain-specific ontology to guide Reinforcement Learning, was first presented at the Knowledge Graph Conference (KGC). Unlike standard Supervised Fine-Tuning (SFT) which primarily teaches mimicry, we employed Reinforcement Learning (RL) to explicitly teach the model *how* to reason effectively within the SEO domain.
73
 
74
  * **Base Model:** `unsloth/gemma-3-4b-it-bnb-4bit` (providing foundational language capabilities).
75
+ * **Structured Knowledge:** The **SEOntology (seovoc)**, an ontology defining key SEO entities, properties, and relationships ([https://w3id.org/seovoc/](https://w3id.org/seovoc/)), served as the structured knowledge base.
76
  * **Learning Method:** Group Relative Policy Optimization (GRPO) via the `trl` library, accelerated with Unsloth. GRPO was chosen to optimize the policy (the model's generation strategy) directly based on reward signals.
77
  * **Ontology-Guided Reward Signal:** This is the core of the methodology. A custom reward function was designed, utilizing an LLM-as-a-Judge (Gemini 1.5 Pro). This judge evaluated the model's generated `<reasoning>` and `<answer>` based on several criteria, **crucially including alignment with SEO best practices and the explicit use/implication of relevant concepts from the `seovoc` ontology**. Models were rewarded for outputs demonstrating logical steps consistent with the knowledge structured in the ontology.
78
 
79
  ## Fine-tuning Details
80
 
81
+ * **Dataset:** A custom synthetic dataset (`cyberandy/seo-grpo-reasoning-dataset-1000` containing ~960 cleaned examples). This dataset was programmatically generated using Gemini 1.5 Pro, based on detailed task templates that explicitly referenced and incorporated concepts from the SEOntology (`seovoc`). The generation process created pairs of input data, step-by-step reasoning (`<reasoning>...</reasoning>`), and a concise answer (`<answer>...</answer>`) for various SEO tasks (Meta Description Optimization, Internal Link Suggestion, Query Trend Analysis, Schema.org Suggestion, NER, Title Optimization, Intent Classification, Robots.txt Rules, Canonicalization, E-E-A-T Assessment, GMB Optimization, Product Schema Enhancement, Content Revision based on QA). These generated examples were then evaluated by an LLM-as-a-Judge (also Gemini 1.5 Pro), which assigned a reward score (between 0.0 and 1.0) based on the accuracy, relevance, format correctness, and **alignment of the reasoning and answer with the seovoc ontology concepts** presented as context to the judge. This scored data was then formatted into `{'prompt': '...', 'reward': float}` pairs for the GRPO training. You can read more about the dataset generation and evaluation methodology in our blog post (linking to the KGC material): [An Ontology-Driven Approach to Train Your Next SEO Agent](https://wordlift.io/blog/en/entity/knowledge-graph-conference/).
82
  * **Training Steps:** `500` steps.
83
  * **Key Hyperparameters:**
84
  * Learning Rate: `5e-6` (cosine decay)
 
281
 
282
  LLM-as-a-Judge (Gemini 1.5 Pro) scores reflected this, indicating stronger performance on simpler, more structured tasks and lower scores on complex reasoning and strict format adherence under stress.
283
 
284
+ **Further details on the methodology and evaluation has been presented at the Knowledge Graph Conference (KGC) 2025.**
285
 
286
  ## Intended Use & Purpose
287
 
 
314
  * Developed by the WordLift team, pushing the boundaries of [Agentic SEO](https://wordlift.io/agent/) and [Marketing Automation](https://wordlift.io/agent/).
315
  * Built upon Google's Gemma 3 model and the Unsloth library for efficient fine-tuning.
316
  * Leverages concepts from schema.org and the SEOntology (seovoc).
317
+ * Methodology presented at the [Knowledge Graph Conference](https://wordlift.io/blog/en/entity/knowledge-graph-conference/) 2025 (KGC).