Update README.md
Browse files
README.md
CHANGED
@@ -18,9 +18,10 @@ language:
|
|
18 |
- nl
|
19 |
---
|
20 |
|
21 |
-
# Phi-4-mini N5 Label Addition Fine-tune
|
22 |
|
23 |
-
|
|
|
|
|
24 |
|
25 |
## Model Details
|
26 |
|
@@ -43,6 +44,7 @@ This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://hu
|
|
43 |
- **Steps:** 1,735
|
44 |
- **Training Metrics:**
|
45 |
- Final Training Loss: 0.7864
|
|
|
46 |
- Training samples/second: 2.209
|
47 |
- Learning rate (final): 6.26e-10
|
48 |
|
@@ -84,18 +86,18 @@ This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://hu
|
|
84 |
|
85 |
### Intended Uses
|
86 |
|
87 |
-
- **
|
88 |
-
- **
|
89 |
-
- **
|
90 |
-
- **
|
91 |
|
92 |
### Limitations
|
93 |
|
94 |
-
- Trained on signaalberichten dataset (
|
95 |
-
-
|
96 |
-
-
|
97 |
-
- Limited to 4K token context (sufficient for
|
98 |
-
-
|
99 |
|
100 |
## How to Use
|
101 |
|
@@ -115,35 +117,32 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
115 |
)
|
116 |
tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n5-phi4-mini-merged")
|
117 |
|
118 |
-
# Prepare input -
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
},
|
126 |
-
"reportedBy": {
|
127 |
-
"@type": "Person",
|
128 |
-
"address": {
|
129 |
-
"@type": "PostalAddress",
|
130 |
-
"addressLocality": "Amsterdam"
|
131 |
-
}
|
132 |
-
}
|
133 |
-
}
|
134 |
|
135 |
messages = [
|
136 |
{
|
137 |
"role": "system",
|
138 |
-
"content": "
|
139 |
},
|
140 |
{
|
141 |
"role": "user",
|
142 |
-
"content": f"""
|
|
|
|
|
|
|
|
|
143 |
|
144 |
-
|
|
|
145 |
|
146 |
-
|
|
|
147 |
}
|
148 |
]
|
149 |
|
@@ -198,27 +197,20 @@ tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n5-phi4-mini-adapter")
|
|
198 |
|
199 |
## Expected Output Format
|
200 |
|
201 |
-
The model
|
202 |
|
203 |
```json
|
204 |
{
|
205 |
-
"
|
206 |
-
"
|
207 |
-
|
208 |
-
|
209 |
-
|
210 |
-
|
211 |
-
"
|
212 |
-
|
213 |
-
|
214 |
-
|
215 |
-
"rdfs:label": "Melder",
|
216 |
-
"address": {
|
217 |
-
"@type": "PostalAddress",
|
218 |
-
"rdfs:label": "Adres in Amsterdam",
|
219 |
-
"addressLocality": "Amsterdam"
|
220 |
-
}
|
221 |
-
}
|
222 |
}
|
223 |
```
|
224 |
|
@@ -227,14 +219,14 @@ The model adds `rdfs:label` properties to make JSON-LD more human-readable:
|
|
227 |
The model was trained on the [UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps) dataset, which contains:
|
228 |
|
229 |
- **Source**: Signaalberichten (citizen complaints to municipalities)
|
230 |
-
- **Domain**:
|
231 |
-
- **N5 Examples**: 4,525
|
232 |
- **Average Token Length**: 1,636 tokens
|
233 |
- **Max Token Length**: 2,332 tokens
|
234 |
- **Format**: ChatML-formatted instruction-following examples
|
235 |
-
- **Task**:
|
236 |
|
237 |
-
**Important**: This
|
238 |
|
239 |
## Training Results
|
240 |
|
@@ -261,17 +253,20 @@ The model completed 3.1 epochs through the dataset:
|
|
261 |
- Large adapter due to r=512
|
262 |
- Includes all training configurations
|
263 |
|
264 |
-
##
|
265 |
|
266 |
-
|
267 |
|
268 |
-
|
269 |
-
|
270 |
-
|
271 |
-
|
272 |
-
|
273 |
|
274 |
-
|
|
|
|
|
|
|
275 |
|
276 |
## Performance Characteristics
|
277 |
|
@@ -288,9 +283,9 @@ If you use this model, please cite:
|
|
288 |
```bibtex
|
289 |
@misc{wim-n5-phi4-mini,
|
290 |
author = {UWV InnovatieHub},
|
291 |
-
title = {Phi-4-mini N5
|
292 |
year = {2025},
|
293 |
publisher = {HuggingFace},
|
294 |
url = {https://huggingface.co/UWV/wim-n5-phi4-mini-merged}
|
295 |
}
|
296 |
-
```
|
|
|
18 |
- nl
|
19 |
---
|
20 |
|
|
|
21 |
|
22 |
+
# Phi-4-mini N5 Complaint Categorization Fine-tune
|
23 |
+
|
24 |
+
This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) optimized for categorizing citizen complaints into predefined topic and experience labels, trained on the signaalberichten dataset.
|
25 |
|
26 |
## Model Details
|
27 |
|
|
|
44 |
- **Steps:** 1,735
|
45 |
- **Training Metrics:**
|
46 |
- Final Training Loss: 0.7864
|
47 |
+
- Final Eval Loss: 0.7796
|
48 |
- Training samples/second: 2.209
|
49 |
- Learning rate (final): 6.26e-10
|
50 |
|
|
|
86 |
|
87 |
### Intended Uses
|
88 |
|
89 |
+
- **Complaint Categorization**: Classify citizen complaints into topic and experience categories
|
90 |
+
- **Municipal Service Analysis**: Analyze phone transcripts and written complaints
|
91 |
+
- **Topic Detection**: Identify what the complaint is about (e.g., waste, parking, permits)
|
92 |
+
- **Experience Analysis**: Determine how citizens experience the service (e.g., communication, speed, clarity)
|
93 |
|
94 |
### Limitations
|
95 |
|
96 |
+
- Trained on signaalberichten dataset (Dutch municipal complaints)
|
97 |
+
- Fixed label vocabulary (cannot create new labels)
|
98 |
+
- Best performance on complaint/service interaction texts
|
99 |
+
- Limited to 4K token context (sufficient for most complaints)
|
100 |
+
- Specific to Dutch government/municipal contexts
|
101 |
|
102 |
## How to Use
|
103 |
|
|
|
117 |
)
|
118 |
tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n5-phi4-mini-merged")
|
119 |
|
120 |
+
# Prepare input - complaint text for categorization
|
121 |
+
complaint_text = """
|
122 |
+
Burger: Nou, waar ik dus over wil klagen is het afval in de buurt.
|
123 |
+
Het is echt niet normaal meer, met al die vuilniszakken die op straat worden gegooid.
|
124 |
+
De containers zijn vaak vol en er komen ook ratten.
|
125 |
+
Ik had al eens gebeld maar er wordt niks aan gedaan!
|
126 |
+
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
127 |
|
128 |
messages = [
|
129 |
{
|
130 |
"role": "system",
|
131 |
+
"content": "Jij bent een expert in het toewijzen van labels aan een tekst."
|
132 |
},
|
133 |
{
|
134 |
"role": "user",
|
135 |
+
"content": f"""Analyseer de onderstaande tekst en bepaal welke labels van toepassing zijn.
|
136 |
+
|
137 |
+
**Onderwerp labels** (selecteer wat van toepassing is):
|
138 |
+
Vuil/ongedierte overlast, Bruikbaarheid/beschikbaarheid afvalcontainers,
|
139 |
+
Parkeeroverlast, Vergunningen, etc.
|
140 |
|
141 |
+
**Beleving labels** (selecteer wat van toepassing is):
|
142 |
+
Communicatie, Op de hoogte houden, Statusinformatie, Snelheid van afhandeling, etc.
|
143 |
|
144 |
+
**Tekst om te analyseren**:
|
145 |
+
{complaint_text}"""
|
146 |
}
|
147 |
]
|
148 |
|
|
|
197 |
|
198 |
## Expected Output Format
|
199 |
|
200 |
+
The model outputs a JSON response with categorization results:
|
201 |
|
202 |
```json
|
203 |
{
|
204 |
+
"reasoning": "Omdat de burger klaagt over afval dat op straat wordt gegooid, volle containers en rattenoverlast, zijn de onderwerpen 'Vuil/ongedierte overlast' en 'Bruikbaarheid/beschikbaarheid afvalcontainers' het meest van toepassing. De beleving is negatief: de burger ervaart frustratie over het uitblijven van actie en het gebrek aan terugkoppeling.",
|
205 |
+
"onderwerp_labels": [
|
206 |
+
"Vuil/ongedierte overlast",
|
207 |
+
"Bruikbaarheid/beschikbaarheid afvalcontainers"
|
208 |
+
],
|
209 |
+
"beleving_labels": [
|
210 |
+
"Op de hoogte houden",
|
211 |
+
"Statusinformatie",
|
212 |
+
"Communicatie"
|
213 |
+
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
214 |
}
|
215 |
```
|
216 |
|
|
|
219 |
The model was trained on the [UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps) dataset, which contains:
|
220 |
|
221 |
- **Source**: Signaalberichten (citizen complaints to municipalities)
|
222 |
+
- **Domain**: Phone transcripts and written complaints about municipal services
|
223 |
+
- **N5 Examples**: 4,525 complaint categorization tasks
|
224 |
- **Average Token Length**: 1,636 tokens
|
225 |
- **Max Token Length**: 2,332 tokens
|
226 |
- **Format**: ChatML-formatted instruction-following examples
|
227 |
+
- **Task**: Categorize complaints into predefined topic and experience labels
|
228 |
|
229 |
+
**Important**: This is a different task and dataset from the WIM pipeline (N1-N4) which focuses on Wikipedia to JSON-LD conversion.
|
230 |
|
231 |
## Training Results
|
232 |
|
|
|
253 |
- Large adapter due to r=512
|
254 |
- Includes all training configurations
|
255 |
|
256 |
+
## Model Context
|
257 |
|
258 |
+
**Note**: Despite the "n5" naming, this model is NOT part of the WIM (Wikipedia to Knowledge Graph) pipeline that includes N1-N4. This is a separate task focused on complaint categorization.
|
259 |
|
260 |
+
### WIM Pipeline (Wikipedia to JSON-LD):
|
261 |
+
1. **N1**: Entity Extraction from Wikipedia text
|
262 |
+
2. **N2**: Schema.org Type Selection for entities
|
263 |
+
3. **N3**: Transform to JSON-LD format
|
264 |
+
4. **N4**: Validation of JSON-LD
|
265 |
|
266 |
+
### This Model (N5 - Complaint Categorization):
|
267 |
+
- **Task**: Categorize citizen complaints into topic and experience labels
|
268 |
+
- **Dataset**: Signaalberichten (municipal complaints)
|
269 |
+
- **Domain**: Government services and citizen interactions
|
270 |
|
271 |
## Performance Characteristics
|
272 |
|
|
|
283 |
```bibtex
|
284 |
@misc{wim-n5-phi4-mini,
|
285 |
author = {UWV InnovatieHub},
|
286 |
+
title = {Phi-4-mini N5 Complaint Categorization Model},
|
287 |
year = {2025},
|
288 |
publisher = {HuggingFace},
|
289 |
url = {https://huggingface.co/UWV/wim-n5-phi4-mini-merged}
|
290 |
}
|
291 |
+
```
|