File size: 8,147 Bytes
254e58c
 
 
 
 
 
 
 
 
 
 
 
 
7f3b70a
 
 
 
 
 
 
 
 
 
 
 
 
 
254e58c
 
7f3b70a
254e58c
 
 
 
 
7f3b70a
254e58c
 
 
 
 
7f3b70a
254e58c
7f3b70a
254e58c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7f3b70a
 
 
254e58c
 
7f3b70a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
254e58c
 
 
 
 
 
 
 
 
 
7f3b70a
254e58c
 
 
 
7f3b70a
254e58c
 
7f3b70a
254e58c
 
 
 
 
 
 
 
 
 
cef04e7
254e58c
 
cef04e7
 
254e58c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cef04e7
254e58c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7f3b70a
254e58c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7f3b70a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
254e58c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7f3b70a
 
254e58c
 
 
 
 
 
7f3b70a
254e58c
 
7f3b70a
254e58c
cef04e7
254e58c
 
 
 
 
 
 
 
 
 
 
7f3b70a
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
---
license: apache-2.0
language:
- en
tags:
- canine
- character-level
- mlm
- domain-names
- pretrained
datasets:
- humbleworth/registered-domains
base_model: google/canine-c
model-index:
- name: domain-mlm-epoch-2
  results:
  - task:
      type: fill-mask
      name: Masked Language Modeling
    dataset:
      name: humbleworth/registered-domains
      type: humbleworth/registered-domains
      split: validation
    metrics:
    - type: perplexity
      value: 3.36
      name: Validation Perplexity
---

# Domain MLM - CANINE Character-Level Model for Domain Names (Epoch 2)

This model is a CANINE-based character-level language model that has been further pre-trained on domain names using masked language modeling (MLM). It's designed to understand and predict patterns in domain names at the character level.

## Model Description

This is a checkpoint from epoch 2 of training CANINE-c on domain name data. The model continues pretraining from Google's CANINE-c base model, adapting it specifically to domain name patterns through masked character prediction.

### Key Features

- **Character-level processing**: Works directly with Unicode code points, no tokenization required
- **Domain-specific**: Pre-trained on 255M registered domain names
- **Masked Language Modeling**: Trained to predict masked characters in domain names (25% masking probability)
- **Efficient**: 132M parameters, suitable for downstream fine-tuning
- **Strong Performance**: Achieved 3.36 validation perplexity

### Architecture

- **Base Model**: `google/canine-c` (CANINE-S with 132M parameters)
- **Model Type**: CANINE (Character Architecture with No tokenization In Neural Encoders)
- **Hidden Size**: 768
- **Layers**: 12
- **Attention Heads**: 12
- **Max Position Embeddings**: 16,384 (though domains typically use <128)
- **Vocabulary**: Direct Unicode code points (no vocabulary file needed)

### Training Details

- **Training Data**: humbleworth/registered-domains dataset (255M domains)
- **Training Objective**: Masked Language Modeling (MLM) with 25% masking probability
- **Masking Strategy**: Mix of contiguous spans (80%) and random characters (20%)
- **Optimizer**: AdamW with learning rate 3e-5, weight decay 0.01
- **Batch Size**: 512 per device with gradient accumulation steps of 3 (effective batch size: 1,536)
- **Hardware**: NVIDIA A100 40GB
- **Mixed Precision**: BF16 automatic mixed precision
- **Training Framework**: PyTorch with custom training loop
- **Warmup Steps**: 2,000
- **Total Steps**: ~830,000 (2 epochs completed at 332,200 steps)
- **Training Time**: ~36 hours for 2 epochs

### Performance Metrics

**Epoch 2 Results**:
- **Training Loss**: 1.29
- **Training Perplexity**: 3.62
- **Validation Loss**: 1.21
- **Validation Perplexity**: 3.36
- **Best Training Perplexity**: 3.49 (achieved during epoch 2)
- **Processing Speed**: 4,037 samples/second
- **GPU Memory Usage**: 2.85 GB (highly optimized)

The model shows excellent convergence, improving from an initial perplexity of 10.08 to 3.36 on validation data. The validation perplexity of 3.36 indicates the model effectively narrows down character predictions to approximately 3-4 likely candidates on average.

## Intended Uses & Limitations

### Intended Uses

- Domain name completion and suggestion
- Understanding domain name patterns
- Feature extraction for domain-related tasks
- Fine-tuning for domain classification tasks
- Domain name generation (with additional fine-tuning)
- Character-level anomaly detection in domains

### Limitations

- Primarily trained on ASCII domain names
- Limited to domains up to 64 characters (training max_length)
- Not suitable for general text understanding tasks
- Performance on internationalized domain names (IDN) may be limited
- The model has learned strong biases toward common TLDs (.com, .net, .org)

## How to Use

### Basic Usage

```python
import torch
from transformers import CanineTokenizer, CanineModel, CanineConfig

# Load tokenizer
tokenizer = CanineTokenizer.from_pretrained('humbleworth/domain-mlm')

# Load base CANINE model
config = CanineConfig.from_pretrained('humbleworth/domain-mlm')
model = CanineModel.from_pretrained('humbleworth/domain-mlm')

# Encode a domain
domain = "example.com"
inputs = tokenizer(domain, return_tensors="pt")

# Get character-level embeddings
with torch.no_grad():
    outputs = model(**inputs)
    char_embeddings = outputs.last_hidden_state
```

### For Masked Language Modeling

To use the model for masked character prediction, you'll need to load the custom MLM head:

```python
# Note: You'll need the custom CanineForMaskedLM class from the training code
# The MLM head weights are stored in training_state.bin

import sys
sys.path.append('path/to/training/code')
from train_mlm import CanineForMaskedLM

# Load model with MLM head
model = CanineForMaskedLM(config)
model.canine = CanineModel.from_pretrained('humbleworth/domain-mlm')

# Load MLM head weights
state_dict = torch.load('training_state.bin', map_location='cpu')
model.mlm_head.load_state_dict(state_dict['mlm_head_state_dict'])

# Predict masked characters
masked_domain = "goo[MASK]le.com"  # [MASK] will be replaced with U+E000
# ... prediction code ...
```

## Training Data

The model was trained on the [humbleworth/registered-domains](https://huggingface.co/datasets/humbleworth/registered-domains) dataset, which contains:

### Dataset Statistics
- **Total Size**: 255,097,510 unique registered domain names
- **File Size**: 4.1 GB
- **Source**: [Domains Project](https://domainsproject.org/)
- **Character Set**: 100% ASCII (no internationalized domains)
- **Average Length**: 15.9 characters (range: 4-77 characters)
- **Training/Validation Split**: 99.9% / 0.1%

### TLD Distribution
- **Total Unique TLDs**: 1,274
- **Top TLDs**:
  - .com: 139,092,425 (54.5%)
  - .net: 12,240,626 (4.8%)
  - .de: 11,349,715 (4.4%)
  - .org: 10,107,145 (4.0%)
  - .nl: 3,739,084 (1.5%)

### Domain Characteristics
- **Domains with numbers**: 22,570,972 (8.8%)
- **Domains with hyphens**: 29,207,936 (11.4%)
- **Character patterns**: Lowercase letters, numbers, hyphens, and dots only

This comprehensive dataset provides excellent coverage of real-world domain patterns, making it ideal for training character-level models to understand domain name structures and conventions.

## Evaluation

### Perplexity Analysis

The model achieved a validation perplexity of **3.36**, which means:
- The model effectively chooses between ~3.36 possible characters on average at each position
- This represents excellent performance for domain name modeling
- The low perplexity indicates strong pattern learning, including:
  - TLD patterns (high certainty after dots)
  - Common domain prefixes and suffixes
  - Valid character sequences in domain names

### Training Progression
- **Initial**: Loss=2.31, Perplexity=10.08
- **Epoch 1**: ~4.5-5.0 perplexity (estimated)
- **Epoch 2**: Loss=1.21, Perplexity=3.36
- **Best achieved**: Perplexity=3.49 (training), 3.36 (validation)

The model appears to be approaching an asymptotic performance around 3.2-3.5 perplexity, suggesting it has learned most learnable patterns in the domain dataset.

## Technical Specifications

### Model Architecture
- 12 transformer layers
- 768 hidden dimensions
- 12 attention heads
- GELU activation
- Layer normalization
- Dropout: 0.1

### Infrastructure
- Trained on NVIDIA A100 40GB GPU
- PyTorch 2.0+
- Mixed precision training (BF16)
- Custom training loop implementation
- Gradient clipping: 1.0
- Training tracked with Weights & Biases

## Citation

If you use this model, please cite:

```bibtex
@misc{domain-mlm-2025,
  title={Domain MLM: Character-Level Language Modeling for Domain Names},
  author={humbleworth},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/humbleworth/domain-mlm}}
}
```

## License

This model is released under the Apache 2.0 license.

## Acknowledgments

- Based on Google's CANINE-c model
- Trained using the humbleworth/registered-domains dataset
- Optimized training code for NVIDIA A100 GPUs
- Training infrastructure provided by Lambda Labs