gpriday commited on
Commit
254e58c
·
verified ·
1 Parent(s): f19443a

Upload folder using huggingface_hub

Browse files
.ipynb_checkpoints/README-checkpoint.md ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - canine
7
+ - character-level
8
+ - mlm
9
+ - domain-names
10
+ - pretrained
11
+ datasets:
12
+ - humbleworth/registered-domains
13
+ base_model: google/canine-c
14
+ ---
15
+
16
+ # Domain MLM - CANINE Character-Level Model for Domain Names
17
+
18
+ This model is a CANINE-based character-level language model that has been further pre-trained on domain names using masked language modeling (MLM). It's designed to understand and predict patterns in domain names at the character level.
19
+
20
+ ## Model Description
21
+
22
+ This is a checkpoint from epoch 1 of training CANINE-c on domain name data. The model continues pretraining from Google's CANINE-c base model, adapting it specifically to domain name patterns through masked character prediction.
23
+
24
+ ### Key Features
25
+
26
+ - **Character-level processing**: Works directly with Unicode code points, no tokenization required
27
+ - **Domain-specific**: Pre-trained on 255M registered domain names
28
+ - **Masked Language Modeling**: Trained to predict masked characters in domain names
29
+ - **Efficient**: 132M parameters, suitable for downstream fine-tuning
30
+
31
+ ### Architecture
32
+
33
+ - **Base Model**: `google/canine-c` (CANINE-S with 132M parameters)
34
+ - **Model Type**: CANINE (Character Architecture with No tokenization In Neural Encoders)
35
+ - **Hidden Size**: 768
36
+ - **Layers**: 12
37
+ - **Attention Heads**: 12
38
+ - **Max Position Embeddings**: 16,384 (though domains typically use <128)
39
+ - **Vocabulary**: Direct Unicode code points (no vocabulary file needed)
40
+
41
+ ### Training Details
42
+
43
+ - **Training Data**: humbleworth/registered-domains dataset (255M domains)
44
+ - **Training Objective**: Masked Language Modeling (MLM) with 25% masking probability
45
+ - **Masking Strategy**: Mix of contiguous spans (80%) and random characters (20%)
46
+ - **Optimizer**: AdamW with learning rate 1e-5
47
+ - **Batch Size**: 256 per device with gradient accumulation (effective batch size: 512)
48
+ - **Hardware**: Optimized for NVIDIA A100 40GB
49
+ - **Mixed Precision**: BF16 automatic mixed precision
50
+ - **Training Framework**: PyTorch with custom training loop
51
+
52
+ ## Intended Uses & Limitations
53
+
54
+ ### Intended Uses
55
+
56
+ - Domain name completion and suggestion
57
+ - Understanding domain name patterns
58
+ - Feature extraction for domain-related tasks
59
+ - Fine-tuning for domain classification tasks
60
+ - Domain name generation (with additional fine-tuning)
61
+
62
+ ### Limitations
63
+
64
+ - This is an early checkpoint (epoch 1) - later checkpoints may perform better
65
+ - Primarily trained on ASCII domain names
66
+ - Limited to domains up to 128 characters
67
+ - Not suitable for general text understanding tasks
68
+ - Performance on internationalized domain names (IDN) may be limited
69
+
70
+ ## How to Use
71
+
72
+ ### Basic Usage
73
+
74
+ ```python
75
+ import torch
76
+ from transformers import CanineTokenizer, CanineModel, CanineConfig
77
+
78
+ # Load tokenizer
79
+ tokenizer = CanineTokenizer.from_pretrained('humbleworth/domain-mlm')
80
+
81
+ # Load base CANINE model
82
+ config = CanineConfig.from_pretrained('humbleworth/domain-mlm')
83
+ model = CanineModel.from_pretrained('humbleworth/domain-mlm')
84
+
85
+ # Encode a domain
86
+ domain = "example.com"
87
+ inputs = tokenizer(domain, return_tensors="pt")
88
+
89
+ # Get character-level embeddings
90
+ with torch.no_grad():
91
+ outputs = model(**inputs)
92
+ char_embeddings = outputs.last_hidden_state
93
+ ```
94
+
95
+ ### For Masked Language Modeling
96
+
97
+ To use the model for masked character prediction, you'll need to load the custom MLM head:
98
+
99
+ ```python
100
+ # Note: You'll need the custom CanineForMaskedLM class from the training code
101
+ # The MLM head weights are stored in training_state.bin
102
+
103
+ import sys
104
+ sys.path.append('path/to/training/code')
105
+ from train_mlm import CanineForMaskedLM
106
+
107
+ # Load model with MLM head
108
+ model = CanineForMaskedLM(config)
109
+ model.canine = CanineModel.from_pretrained('humbleworth/domain-mlm')
110
+
111
+ # Load MLM head weights
112
+ state_dict = torch.load('training_state.bin', map_location='cpu')
113
+ model.mlm_head.load_state_dict(state_dict['mlm_head_state_dict'])
114
+
115
+ # Predict masked characters
116
+ masked_domain = "goo[MASK]le.com" # [MASK] will be replaced with U+E000
117
+ # ... prediction code ...
118
+ ```
119
+
120
+ ## Training Data
121
+
122
+ The model was trained on the [humbleworth/registered-domains](https://huggingface.co/datasets/humbleworth/registered-domains) dataset containing:
123
+
124
+ - 255 million registered domain names
125
+ - 1,274 unique TLDs
126
+ - 54.5% .com domains
127
+ - 8.8% domains containing numbers
128
+ - 11.4% domains containing hyphens
129
+ - Average length: ~16 characters
130
+ - 100% ASCII characters
131
+
132
+ ## Evaluation
133
+
134
+ This is an intermediate checkpoint. Full evaluation metrics will be available with the final model release. The model achieved reasonable perplexity on the validation set during training.
135
+
136
+ ## Technical Specifications
137
+
138
+ ### Model Architecture
139
+ - 12 transformer layers
140
+ - 768 hidden dimensions
141
+ - 12 attention heads
142
+ - GELU activation
143
+ - Layer normalization
144
+ - Dropout: 0.1
145
+
146
+ ### Infrastructure
147
+ - Trained on NVIDIA A100 40GB GPU
148
+ - PyTorch 2.0+
149
+ - Mixed precision training (BF16)
150
+ - Custom training loop implementation
151
+
152
+ ## Citation
153
+
154
+ If you use this model, please cite:
155
+
156
+ ```bibtex
157
+ @misc{domain-mlm-2024,
158
+ title={Domain MLM: Character-Level Language Modeling for Domain Names},
159
+ author={humbleworth},
160
+ year={2024},
161
+ publisher={Hugging Face},
162
+ howpublished={\url{https://huggingface.co/humbleworth/domain-mlm}}
163
+ }
164
+ ```
165
+
166
+ ## License
167
+
168
+ This model is released under the Apache 2.0 license.
169
+
170
+ ## Acknowledgments
171
+
172
+ - Based on Google's CANINE-c model
173
+ - Trained using the humbleworth/registered-domains dataset
174
+ - Optimized training code for NVIDIA A100 GPUs
README.md ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - canine
7
+ - character-level
8
+ - mlm
9
+ - domain-names
10
+ - pretrained
11
+ datasets:
12
+ - humbleworth/registered-domains
13
+ base_model: google/canine-c
14
+ ---
15
+
16
+ # Domain MLM - CANINE Character-Level Model for Domain Names
17
+
18
+ This model is a CANINE-based character-level language model that has been further pre-trained on domain names using masked language modeling (MLM). It's designed to understand and predict patterns in domain names at the character level.
19
+
20
+ ## Model Description
21
+
22
+ This is a checkpoint from epoch 1 of training CANINE-c on domain name data. The model continues pretraining from Google's CANINE-c base model, adapting it specifically to domain name patterns through masked character prediction.
23
+
24
+ ### Key Features
25
+
26
+ - **Character-level processing**: Works directly with Unicode code points, no tokenization required
27
+ - **Domain-specific**: Pre-trained on 255M registered domain names
28
+ - **Masked Language Modeling**: Trained to predict masked characters in domain names
29
+ - **Efficient**: 132M parameters, suitable for downstream fine-tuning
30
+
31
+ ### Architecture
32
+
33
+ - **Base Model**: `google/canine-c` (CANINE-S with 132M parameters)
34
+ - **Model Type**: CANINE (Character Architecture with No tokenization In Neural Encoders)
35
+ - **Hidden Size**: 768
36
+ - **Layers**: 12
37
+ - **Attention Heads**: 12
38
+ - **Max Position Embeddings**: 16,384 (though domains typically use <128)
39
+ - **Vocabulary**: Direct Unicode code points (no vocabulary file needed)
40
+
41
+ ### Training Details
42
+
43
+ - **Training Data**: humbleworth/registered-domains dataset (255M domains)
44
+ - **Training Objective**: Masked Language Modeling (MLM) with 25% masking probability
45
+ - **Masking Strategy**: Mix of contiguous spans (80%) and random characters (20%)
46
+ - **Optimizer**: AdamW with learning rate 1e-5
47
+ - **Batch Size**: 256 per device with gradient accumulation (effective batch size: 512)
48
+ - **Hardware**: Optimized for NVIDIA A100 40GB
49
+ - **Mixed Precision**: BF16 automatic mixed precision
50
+ - **Training Framework**: PyTorch with custom training loop
51
+
52
+ ## Intended Uses & Limitations
53
+
54
+ ### Intended Uses
55
+
56
+ - Domain name completion and suggestion
57
+ - Understanding domain name patterns
58
+ - Feature extraction for domain-related tasks
59
+ - Fine-tuning for domain classification tasks
60
+ - Domain name generation (with additional fine-tuning)
61
+
62
+ ### Limitations
63
+
64
+ - This is an early checkpoint (epoch 1) - later checkpoints may perform better
65
+ - Primarily trained on ASCII domain names
66
+ - Limited to domains up to 128 characters
67
+ - Not suitable for general text understanding tasks
68
+ - Performance on internationalized domain names (IDN) may be limited
69
+
70
+ ## How to Use
71
+
72
+ ### Basic Usage
73
+
74
+ ```python
75
+ import torch
76
+ from transformers import CanineTokenizer, CanineModel, CanineConfig
77
+
78
+ # Load tokenizer
79
+ tokenizer = CanineTokenizer.from_pretrained('humbleworth/domain-mlm')
80
+
81
+ # Load base CANINE model
82
+ config = CanineConfig.from_pretrained('humbleworth/domain-mlm')
83
+ model = CanineModel.from_pretrained('humbleworth/domain-mlm')
84
+
85
+ # Encode a domain
86
+ domain = "example.com"
87
+ inputs = tokenizer(domain, return_tensors="pt")
88
+
89
+ # Get character-level embeddings
90
+ with torch.no_grad():
91
+ outputs = model(**inputs)
92
+ char_embeddings = outputs.last_hidden_state
93
+ ```
94
+
95
+ ### For Masked Language Modeling
96
+
97
+ To use the model for masked character prediction, you'll need to load the custom MLM head:
98
+
99
+ ```python
100
+ # Note: You'll need the custom CanineForMaskedLM class from the training code
101
+ # The MLM head weights are stored in training_state.bin
102
+
103
+ import sys
104
+ sys.path.append('path/to/training/code')
105
+ from train_mlm import CanineForMaskedLM
106
+
107
+ # Load model with MLM head
108
+ model = CanineForMaskedLM(config)
109
+ model.canine = CanineModel.from_pretrained('humbleworth/domain-mlm')
110
+
111
+ # Load MLM head weights
112
+ state_dict = torch.load('training_state.bin', map_location='cpu')
113
+ model.mlm_head.load_state_dict(state_dict['mlm_head_state_dict'])
114
+
115
+ # Predict masked characters
116
+ masked_domain = "goo[MASK]le.com" # [MASK] will be replaced with U+E000
117
+ # ... prediction code ...
118
+ ```
119
+
120
+ ## Training Data
121
+
122
+ The model was trained on the [humbleworth/registered-domains](https://huggingface.co/datasets/humbleworth/registered-domains) dataset, which contains:
123
+
124
+ ### Dataset Statistics
125
+ - **Total Size**: 255,097,510 unique registered domain names
126
+ - **File Size**: 4.1 GB
127
+ - **Source**: [Domains Project](https://domainsproject.org/)
128
+ - **Character Set**: 100% ASCII (no internationalized domains)
129
+ - **Average Length**: 15.9 characters (range: 4-77 characters)
130
+
131
+ ### TLD Distribution
132
+ - **Total Unique TLDs**: 1,274
133
+ - **Top TLDs**:
134
+ - .com: 139,092,425 (54.5%)
135
+ - .net: 12,240,626 (4.8%)
136
+ - .de: 11,349,715 (4.4%)
137
+ - .org: 10,107,145 (4.0%)
138
+ - .nl: 3,739,084 (1.5%)
139
+
140
+ ### Domain Characteristics
141
+ - **Domains with numbers**: 22,570,972 (8.8%)
142
+ - **Domains with hyphens**: 29,207,936 (11.4%)
143
+ - **Character patterns**: Lowercase letters, numbers, hyphens, and dots only
144
+
145
+ This comprehensive dataset provides excellent coverage of real-world domain patterns, making it ideal for training character-level models to understand domain name structures and conventions.
146
+
147
+ ## Evaluation
148
+
149
+ This is an intermediate checkpoint. Full evaluation metrics will be available with the final model release. The model achieved reasonable perplexity on the validation set during training.
150
+
151
+ ## Technical Specifications
152
+
153
+ ### Model Architecture
154
+ - 12 transformer layers
155
+ - 768 hidden dimensions
156
+ - 12 attention heads
157
+ - GELU activation
158
+ - Layer normalization
159
+ - Dropout: 0.1
160
+
161
+ ### Infrastructure
162
+ - Trained on NVIDIA A100 40GB GPU
163
+ - PyTorch 2.0+
164
+ - Mixed precision training (BF16)
165
+ - Custom training loop implementation
166
+
167
+ ## Citation
168
+
169
+ If you use this model, please cite:
170
+
171
+ ```bibtex
172
+ @misc{domain-mlm-2024,
173
+ title={Domain MLM: Character-Level Language Modeling for Domain Names},
174
+ author={humbleworth},
175
+ year={2024},
176
+ publisher={Hugging Face},
177
+ howpublished={\url{https://huggingface.co/humbleworth/domain-mlm}}
178
+ }
179
+ ```
180
+
181
+ ## License
182
+
183
+ This model is released under the Apache 2.0 license.
184
+
185
+ ## Acknowledgments
186
+
187
+ - Based on Google's CANINE-c model
188
+ - Trained using the humbleworth/registered-domains dataset
189
+ - Optimized training code for NVIDIA A100 GPUs
config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "CanineModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 57344,
7
+ "downsampling_rate": 4,
8
+ "eos_token_id": 57345,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "local_transformer_stride": 128,
16
+ "max_position_embeddings": 16384,
17
+ "model_type": "canine",
18
+ "num_attention_heads": 12,
19
+ "num_hash_buckets": 16384,
20
+ "num_hash_functions": 8,
21
+ "num_hidden_layers": 12,
22
+ "pad_token_id": 0,
23
+ "torch_dtype": "float32",
24
+ "transformers_version": "4.53.1",
25
+ "type_vocab_size": 16,
26
+ "upsampling_kernel_size": 4,
27
+ "use_cache": true
28
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:12ff2a2f7c7820d20880edd8f7521069c57ba546a67a8269c23b8ec26fdd5225
3
+ size 528359880
special_tokens_map.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "",
25
+ "lstrip": true,
26
+ "normalized": true,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "\u0000",
32
+ "lstrip": false,
33
+ "normalized": true,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "",
39
+ "lstrip": false,
40
+ "normalized": true,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ }
44
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "\u0000",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "57344": {
13
+ "content": "",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "57345": {
21
+ "content": "",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "57347": {
29
+ "content": "",
30
+ "lstrip": true,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ }
36
+ },
37
+ "bos_token": "",
38
+ "clean_up_tokenization_spaces": false,
39
+ "cls_token": "",
40
+ "eos_token": "",
41
+ "extra_special_tokens": {},
42
+ "mask_token": "",
43
+ "model_max_length": 2048,
44
+ "pad_token": "\u0000",
45
+ "sep_token": "",
46
+ "tokenizer_class": "CanineTokenizer"
47
+ }
training_state.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:123b27f1a56b452ba4c9c862773b14246d00538b73e877dd9c6fd6889a3e0fbc
3
+ size 1210452589