sanskxr02 commited on
Commit
1c10c9a
·
verified ·
1 Parent(s): b2a5f91

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -132
README.md CHANGED
@@ -97,6 +97,8 @@ Users should be aware that ZentryPII-278M is optimized for ASR-style conversatio
97
 
98
  ---
99
 
 
 
100
  ## How to Get Started with the Model
101
 
102
  Use the code snippet below to run the model using 🤗 Transformers:
@@ -113,144 +115,12 @@ output = ner("i met rohit near connaught place at three thirty")
113
  for ent in output:
114
  print(f"{ent['word']} → {ent['entity_group']}")
115
 
116
- ## Training Details
117
-
118
- ### Training Data
119
-
120
- The training data for ZentryPII-278M consists of a synthetic dataset of ~1,000 conversational, ASR-style utterances constructed using name, location, and time expression templates. These examples were generated to reflect realistic speech patterns, disfluencies (e.g., "um", "haan", "like"), and English-Hindi code-switching.
121
-
122
- Each sentence was tokenized and annotated with BIO-style labels:
123
- - `B-NAME`: Names (e.g., Ramesh, Neha)
124
- - `B-LOC`: Locations (e.g., Mumbai, Connaught Place)
125
- - `B-TIME`: Time references (e.g., three thirty, Sunday morning)
126
- - `O`: Non-PII tokens
127
-
128
- The dataset is not publicly released as a standalone resource, but was generated specifically to fine-tune ZentryPII on redaction-style PII tagging in noisy, multilingual text.
129
-
130
- ### Training Procedure
131
-
132
- - **Model used:** `xlm-roberta-base`
133
- - **Objective:** Token classification using cross-entropy loss
134
- - **Training Framework:** Hugging Face Transformers (Trainer API)
135
- - **Epochs:** 5
136
- - **Train/Test Split:** 90/10 stratified at sentence level
137
- - **Batch size:** 8
138
- - **Optimizer:** AdamW
139
- - **Hardware:** Google Colab T4 (free tier)
140
- - **Eval Metric:** Token-level accuracy and `seqeval` loss
141
- - **Final Eval Loss:** ~3.26e-05
142
-
143
-
144
-
145
- #### Training Hyperparameters
146
-
147
- - **Training regime:** fp32
148
- - **Learning rate scheduler:** linear with warmup
149
- - **Weight decay:** 0.01
150
- - **Warmup steps:** 0
151
- - **Gradient clipping:** None
152
- - **Evaluation strategy:** after each epoch
153
- - **Save strategy:** every 500 steps
154
- - **Logging:** every 10 steps
155
- - **Seed:** 42
156
-
157
- ---
158
-
159
- #### Speeds, Sizes, Times
160
-
161
- - **Total training time:** ~20 minutes on Google Colab (T4 GPU)
162
- - **Final checkpoint size:** ~1.06 GB
163
- - **Throughput:** ~39 samples/sec on evaluation
164
- - **Evaluation runtime:** ~2.55 seconds
165
-
166
- ---
167
-
168
- ## Evaluation
169
-
170
- ### Testing Data, Factors & Metrics
171
-
172
- #### Testing Data
173
-
174
- The held-out test set (~10% of training data) was sampled from the synthetic BIO-tagged dataset used for training. It includes ASR-style sentences with varied disfluencies and code-switching examples in Hindi-English.
175
-
176
- Evaluation was performed using Hugging Face’s `Trainer.evaluate()` API on token-level classification.
177
-
178
- **Entity types evaluated:**
179
- - `B-NAME`
180
- - `B-LOC`
181
- - `B-TIME`
182
-
183
- **Metrics used:**
184
- - `eval_loss` (cross-entropy)
185
- - [planned: `seqeval` F1-score in future update]
186
-
187
- **Final evaluation result:**
188
- - `eval_loss`: `3.26e-05`
189
 
190
 
191
 
192
- #### Factors
193
 
194
- The evaluation did not explicitly disaggregate results by subpopulations or domains. However, the synthetic test set includes:
195
- - Code-switched utterances (Hindi-English)
196
- - Disfluent speech (fillers, hesitations)
197
- - Mixed-case and punctuation-stripped phrases to simulate Whisper-style ASR output
198
 
199
- Future iterations may include evaluations across real-world datasets and dialectal variation.
200
 
201
- ---
202
-
203
- #### Metrics
204
-
205
- - **Eval Loss** (cross-entropy): Measures token-level classification confidence.
206
- - Intended metrics such as **precision**, **recall**, and **F1-score** (via `seqeval`) were not computed in this release but will be included in a future version for fine-grained NER-style performance analysis.
207
-
208
- ---
209
-
210
- ### Results
211
-
212
- - **Eval Loss (final checkpoint):** `3.26e-05`
213
- - **Evaluation runtime:** `2.55 seconds`
214
- - **Samples/sec during evaluation:** ~39
215
-
216
- The extremely low evaluation loss indicates strong learning stability and high token-level accuracy within the test domain.
217
-
218
- ---
219
-
220
- #### Summary
221
-
222
- ZentryPII-278M demonstrates excellent performance on synthetic ASR-style transcripts, achieving near-zero loss over a multilingual, code-switched BIO-tagged benchmark. Its architecture (XLM-RoBERTa) allows for robust generalization across English and Hindi tokens with informal structure. While tested only on synthetic data, it sets the foundation for more rigorous real-world deployment within LexGuard’s privacy-preserving stack.
223
-
224
-
225
- ### Model Architecture and Objective
226
-
227
- ZentryPII-278M is based on the `xlm-roberta-base` transformer architecture — a multilingual masked language model pretrained on 100+ languages. It has been adapted for the **token classification** objective using a linear classifier head on top of the contextual embeddings. The model is fine-tuned using a BIO tagging scheme to detect and label PII entities such as names, locations, and temporal references.
228
-
229
- ---
230
-
231
- ### Compute Infrastructure
232
-
233
- #### Hardware
234
-
235
- - Training was performed on a **Google Colab T4 GPU instance**
236
- - 16 GB system RAM
237
- - GPU Memory: ~15 GB
238
-
239
- #### Software
240
-
241
- - Python 3.10
242
- - Hugging Face Transformers 4.38+
243
- - Datasets 2.x
244
- - PyTorch 2.x
245
- - Accelerate and Tokenizers libraries
246
- - Environment: Google Colab (Free Tier)
247
-
248
- ---
249
 
250
- ## Model Card Contact
251
 
252
- For questions, usage inquiries, or integration support:
253
 
254
- **📧 Email:** [email protected]
255
- **👤 Maintainer:** Sanskar Pandey
256
- **🏢 Organization:** LexGuard
 
97
 
98
  ---
99
 
100
+ # ZentryPII-278M
101
+
102
  ## How to Get Started with the Model
103
 
104
  Use the code snippet below to run the model using 🤗 Transformers:
 
115
  for ent in output:
116
  print(f"{ent['word']} → {ent['entity_group']}")
117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
 
119
 
120
 
 
121
 
 
 
 
 
122
 
 
123
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
 
 
125
 
 
126