UWV
/

wim-n5-phi4-mini-merged

@@ -11,53 +11,286 @@ tags:
 - unsloth
 licence: license
 pipeline_tag: text-generation
 ---
-# Model Card for n5_label_addition_model
-This model is a fine-tuned version of [unsloth/Phi-4-mini-instruct-bnb-4bit](https://huggingface.co/unsloth/Phi-4-mini-instruct-bnb-4bit).
-It has been trained using [TRL](https://github.com/huggingface/trl).
-## Quick start
 ```python
-from transformers import pipeline
-question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
-generator = pipeline("text-generation", model="None", device="cuda")
-output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
-print(output["generated_text"])
 ```
-## Training procedure
-[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/yepster/n5-label-addition-unsloth/runs/xz3qtjtl)
-This model was trained with SFT.
-### Framework versions
-- PEFT 0.16.0
-- TRL: 0.19.1
-- Transformers: 4.53.1
-- Pytorch: 2.7.1
-- Datasets: 4.0.0
-- Tokenizers: 0.21.2
-## Citations
-Cite TRL as:
 ```bibtex
-@misc{vonwerra2022trl,
-	title        = {{TRL: Transformer Reinforcement Learning}},
-	author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
-	year         = 2020,
-	journal      = {GitHub repository},
-	publisher    = {GitHub},
-	howpublished = {\url{https://github.com/huggingface/trl}}
 }
 ```

 - unsloth
 licence: license
 pipeline_tag: text-generation
+license: apache-2.0
+datasets:
+- UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps
+language:
+- nl
 ---
+# Phi-4-mini N5 Label Addition Fine-tune
+This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) optimized for adding human-readable labels (rdfs:label) to JSON-LD structures, trained as part of the WIM (Text-to-Knowledge Graph) pipeline on the signaalberichten dataset.
+## Model Details
+### Model Description
+- **Developed by:** UWV InnovatieHub
+- **Model type:** Causal Language Model with LoRA fine-tuning
+- **Language(s):** Dutch (nl)
+- **License:** MIT
+- **Finetuned from:** microsoft/Phi-4-mini-instruct (3.82B parameters)
+- **Training Framework:** Unsloth (optimized training for efficient processing)
+### Training Details
+- **Dataset:** [UWV/wim_instruct_signaalberichten_to_jsonld_agent_steps](https://huggingface.co/datasets/UWV/wim_instruct_signaalberichten_to_jsonld_agent_steps)
+- **Dataset Size:** 4,525 N5-specific examples (label addition tasks)
+- **Training Duration:** 1 hour 44 minutes
+- **Hardware:** NVIDIA A100 80GB
+- **Epochs:** 3.1
+- **Steps:** 1,735
+- **Training Metrics:**
+  - Final Training Loss: 0.7864
+  - Training samples/second: 2.209
+  - Learning rate (final): 6.26e-10
+### LoRA Configuration
+```python
+{
+    "r": 512,                    # Large rank for quality
+    "lora_alpha": 1024,         # Alpha (2:1 ratio)
+    "lora_dropout": 0.1,        # Higher dropout for small dataset
+    "bias": "none",
+    "task_type": "CAUSAL_LM",
+    "target_modules": [
+        "q_proj", "k_proj", "v_proj", "o_proj"  # Attention layers only
+    ]
+}
+```
+### Training Configuration
+```python
+{
+    "model": "phi4-mini",
+    "max_seq_length": 4096,
+    "batch_size": 8,
+    "gradient_accumulation_steps": 1,
+    "effective_batch_size": 8,
+    "learning_rate": 2e-5,
+    "warmup_steps": 50,
+    "max_grad_norm": 1.0,
+    "lr_scheduler": "cosine",
+    "optimizer": "paged_adamw_8bit",
+    "bf16": True,
+    "seed": 42
+}
+```
+## Intended Uses & Limitations
+### Intended Uses
+- **Label Addition**: Add human-readable Dutch labels (rdfs:label) to JSON-LD structures
+- **Knowledge Graph Enhancement**: Fifth step (N5) in the WIM pipeline
+- **Government Services**: Optimized for citizen complaints and government service descriptions
+- **JSON-LD Enrichment**: Make knowledge graphs more accessible with descriptive labels
+### Limitations
+- Trained on signaalberichten dataset (different domain than N1-N3)
+- Best performance on government/municipal service contexts
+- Requires well-formed JSON-LD as input
+- Limited to 4K token context (sufficient for label addition)
+- Small training dataset (4,525 examples)
+## How to Use
+### Option 1: Using the Merged Model (Recommended)
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+import json
+# Load the merged model (ready to use)
+model = AutoModelForCausalLM.from_pretrained(
+    "UWV/wim-n5-phi4-mini-merged",
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n5-phi4-mini-merged")
+# Prepare input - JSON-LD without labels (citizen complaint)
+json_ld = {
+    "@context": "https://schema.org",
+    "@type": "Report",
+    "about": {
+        "@type": "CivicStructure",
+        "name": "Speeltuin Vondelpark"
+    },
+    "reportedBy": {
+        "@type": "Person",
+        "address": {
+            "@type": "PostalAddress",
+            "addressLocality": "Amsterdam"
+        }
+    }
+}
+messages = [
+    {
+        "role": "system",
+        "content": "Je bent een expert in het toevoegen van Nederlandse labels aan JSON-LD."
+    },
+    {
+        "role": "user",
+        "content": f"""Voeg rdfs:label toe aan de volgende JSON-LD:
+{json.dumps(json_ld, ensure_ascii=False, indent=2)}
+Geef de complete JSON-LD terug met labels."""
+    }
+]
+# Apply chat template and generate
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=4096)
+inputs = {k: v.to(model.device) for k, v in inputs.items()}
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=1000,
+        temperature=0.1,  # Low temperature for consistent labeling
+        do_sample=True,
+        top_p=0.95,
+        pad_token_id=tokenizer.pad_token_id,
+        eos_token_id=tokenizer.eos_token_id,
+    )
+# Decode response
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+if "assistant:" in response:
+    response = response.split("assistant:")[-1].strip()
+print(response)
+```
+### Option 2: Using the LoRA Adapter
 ```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+import torch
+# Load base model
+base_model = AutoModelForCausalLM.from_pretrained(
+    "microsoft/Phi-4-mini-instruct",
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+# Load adapter
+model = PeftModel.from_pretrained(
+    base_model,
+    "UWV/wim-n5-phi4-mini-adapter"
+)
+tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n5-phi4-mini-adapter")
+# Use same inference code as above...
 ```
+## Expected Output Format
+The model adds `rdfs:label` properties to make JSON-LD more human-readable:
+```json
+{
+    "@context": "https://schema.org",
+    "@type": "Report",
+    "rdfs:label": "Melding",
+    "about": {
+        "@type": "CivicStructure",
+        "rdfs:label": "Speeltuin Vondelpark",
+        "name": "Speeltuin Vondelpark"
+    },
+    "reportedBy": {
+        "@type": "Person",
+        "rdfs:label": "Melder",
+        "address": {
+            "@type": "PostalAddress",
+            "rdfs:label": "Adres in Amsterdam",
+            "addressLocality": "Amsterdam"
+        }
+    }
+}
+```
+## Dataset Information
+The model was trained on the [UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps) dataset, which contains:
+- **Source**: Signaalberichten (citizen complaints to municipalities)
+- **Domain**: Government services and municipal operations
+- **N5 Examples**: 4,525 label addition tasks
+- **Average Token Length**: 1,636 tokens
+- **Max Token Length**: 2,332 tokens
+- **Format**: ChatML-formatted instruction-following examples
+- **Task**: Add Dutch rdfs:label properties to JSON-LD
+**Important**: This dataset is different from the Wikipedia-based dataset used for N1-N3 models.
+## Training Results
+The model completed 3.1 epochs through the dataset:
+- **Final Training Loss**: 0.7864
+- **Training Efficiency**: 2.209 samples/second
+### Loss Progression
+- Started at ~1.13 loss
+- Rapid improvement in first epoch
+- Stable convergence throughout training
+- Final learning rate: 6.26e-10 (cosine decay)
+- Gradient norms: Stable around 0.6-0.7
+## Model Versions
+- **Merged Model**: `UWV/wim-n5-phi4-mini-merged`
+  - Note: Merge failed due to known Phi-4 issue
+  - Adapter weights saved instead
+  - Model works fine for inference
+- **LoRA Adapter**: `UWV/wim-n5-phi4-mini-adapter` (~2.29 GB)
+  - Requires base Phi-4-mini-instruct model
+  - Large adapter due to r=512
+  - Includes all training configurations
+## Pipeline Context
+This model is part of the WIM (Text-to-Knowledge Graph) pipeline:
+1. **N1**: Entity Extraction
+2. **N2**: Schema.org Type Selection
+3. **N3**: Transform to JSON-LD
+4. **N4**: Validation
+5. **N5 (This Model)**: Add Human-Readable Labels
+N5 is trained on a different dataset (signaalberichten) than N1-N3, focusing on government services and citizen interactions rather than encyclopedic content.
+## Performance Characteristics
+- **Sequence Length**: Average 1,636 tokens (moderate length)
+- **Batch Processing**: Can handle batch size 8 with 4K context
+- **Inference Speed**: Fast label addition to existing JSON-LD
+- **Memory Usage**: ~10GB VRAM with 4K context
+- **Domain**: Specialized for Dutch government/municipal contexts
+## Citation
+If you use this model, please cite:
 ```bibtex
+@misc{wim-n5-phi4-mini,
+  author = {UWV InnovatieHub},
+  title = {Phi-4-mini N5 Label Addition Model},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/UWV/wim-n5-phi4-mini-merged}
 }
 ```