Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

.ipynb_checkpoints/README-checkpoint.md +174 -0
README.md +189 -0
config.json +28 -0
model.safetensors +3 -0
special_tokens_map.json +44 -0
tokenizer_config.json +47 -0
training_state.bin +3 -0

.ipynb_checkpoints/README-checkpoint.md ADDED Viewed

	@@ -0,0 +1,174 @@

+---
+license: apache-2.0
+language:
+- en
+tags:
+- canine
+- character-level
+- mlm
+- domain-names
+- pretrained
+datasets:
+- humbleworth/registered-domains
+base_model: google/canine-c
+---
+# Domain MLM - CANINE Character-Level Model for Domain Names
+This model is a CANINE-based character-level language model that has been further pre-trained on domain names using masked language modeling (MLM). It's designed to understand and predict patterns in domain names at the character level.
+## Model Description
+This is a checkpoint from epoch 1 of training CANINE-c on domain name data. The model continues pretraining from Google's CANINE-c base model, adapting it specifically to domain name patterns through masked character prediction.
+### Key Features
+- **Character-level processing**: Works directly with Unicode code points, no tokenization required
+- **Domain-specific**: Pre-trained on 255M registered domain names
+- **Masked Language Modeling**: Trained to predict masked characters in domain names
+- **Efficient**: 132M parameters, suitable for downstream fine-tuning
+### Architecture
+- **Base Model**: `google/canine-c` (CANINE-S with 132M parameters)
+- **Model Type**: CANINE (Character Architecture with No tokenization In Neural Encoders)
+- **Hidden Size**: 768
+- **Layers**: 12
+- **Attention Heads**: 12
+- **Max Position Embeddings**: 16,384 (though domains typically use <128)
+- **Vocabulary**: Direct Unicode code points (no vocabulary file needed)
+### Training Details
+- **Training Data**: humbleworth/registered-domains dataset (255M domains)
+- **Training Objective**: Masked Language Modeling (MLM) with 25% masking probability
+- **Masking Strategy**: Mix of contiguous spans (80%) and random characters (20%)
+- **Optimizer**: AdamW with learning rate 1e-5
+- **Batch Size**: 256 per device with gradient accumulation (effective batch size: 512)
+- **Hardware**: Optimized for NVIDIA A100 40GB
+- **Mixed Precision**: BF16 automatic mixed precision
+- **Training Framework**: PyTorch with custom training loop
+## Intended Uses & Limitations
+### Intended Uses
+- Domain name completion and suggestion
+- Understanding domain name patterns
+- Feature extraction for domain-related tasks
+- Fine-tuning for domain classification tasks
+- Domain name generation (with additional fine-tuning)
+### Limitations
+- This is an early checkpoint (epoch 1) - later checkpoints may perform better
+- Primarily trained on ASCII domain names
+- Limited to domains up to 128 characters
+- Not suitable for general text understanding tasks
+- Performance on internationalized domain names (IDN) may be limited
+## How to Use
+### Basic Usage
+```python
+import torch
+from transformers import CanineTokenizer, CanineModel, CanineConfig
+# Load tokenizer
+tokenizer = CanineTokenizer.from_pretrained('humbleworth/domain-mlm')
+# Load base CANINE model
+config = CanineConfig.from_pretrained('humbleworth/domain-mlm')
+model = CanineModel.from_pretrained('humbleworth/domain-mlm')
+# Encode a domain
+domain = "example.com"
+inputs = tokenizer(domain, return_tensors="pt")
+# Get character-level embeddings
+with torch.no_grad():
+    outputs = model(**inputs)
+    char_embeddings = outputs.last_hidden_state
+```
+### For Masked Language Modeling
+To use the model for masked character prediction, you'll need to load the custom MLM head:
+```python
+# Note: You'll need the custom CanineForMaskedLM class from the training code
+# The MLM head weights are stored in training_state.bin
+import sys
+sys.path.append('path/to/training/code')
+from train_mlm import CanineForMaskedLM
+# Load model with MLM head
+model = CanineForMaskedLM(config)
+model.canine = CanineModel.from_pretrained('humbleworth/domain-mlm')
+# Load MLM head weights
+state_dict = torch.load('training_state.bin', map_location='cpu')
+model.mlm_head.load_state_dict(state_dict['mlm_head_state_dict'])
+# Predict masked characters
+masked_domain = "goo[MASK]le.com"  # [MASK] will be replaced with U+E000
+# ... prediction code ...
+```
+## Training Data
+The model was trained on the [humbleworth/registered-domains](https://huggingface.co/datasets/humbleworth/registered-domains) dataset containing:
+- 255 million registered domain names
+- 1,274 unique TLDs
+- 54.5% .com domains
+- 8.8% domains containing numbers
+- 11.4% domains containing hyphens
+- Average length: ~16 characters
+- 100% ASCII characters
+## Evaluation
+This is an intermediate checkpoint. Full evaluation metrics will be available with the final model release. The model achieved reasonable perplexity on the validation set during training.
+## Technical Specifications
+### Model Architecture
+- 12 transformer layers
+- 768 hidden dimensions
+- 12 attention heads
+- GELU activation
+- Layer normalization
+- Dropout: 0.1
+### Infrastructure
+- Trained on NVIDIA A100 40GB GPU
+- PyTorch 2.0+
+- Mixed precision training (BF16)
+- Custom training loop implementation
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{domain-mlm-2024,
+  title={Domain MLM: Character-Level Language Modeling for Domain Names},
+  author={humbleworth},
+  year={2024},
+  publisher={Hugging Face},
+  howpublished={\url{https://huggingface.co/humbleworth/domain-mlm}}
+}
+```
+## License
+This model is released under the Apache 2.0 license.
+## Acknowledgments
+- Based on Google's CANINE-c model
+- Trained using the humbleworth/registered-domains dataset
+- Optimized training code for NVIDIA A100 GPUs

README.md ADDED Viewed

	@@ -0,0 +1,189 @@

+---
+license: apache-2.0
+language:
+- en
+tags:
+- canine
+- character-level
+- mlm
+- domain-names
+- pretrained
+datasets:
+- humbleworth/registered-domains
+base_model: google/canine-c
+---
+# Domain MLM - CANINE Character-Level Model for Domain Names
+This model is a CANINE-based character-level language model that has been further pre-trained on domain names using masked language modeling (MLM). It's designed to understand and predict patterns in domain names at the character level.
+## Model Description
+This is a checkpoint from epoch 1 of training CANINE-c on domain name data. The model continues pretraining from Google's CANINE-c base model, adapting it specifically to domain name patterns through masked character prediction.
+### Key Features
+- **Character-level processing**: Works directly with Unicode code points, no tokenization required
+- **Domain-specific**: Pre-trained on 255M registered domain names
+- **Masked Language Modeling**: Trained to predict masked characters in domain names
+- **Efficient**: 132M parameters, suitable for downstream fine-tuning
+### Architecture
+- **Base Model**: `google/canine-c` (CANINE-S with 132M parameters)
+- **Model Type**: CANINE (Character Architecture with No tokenization In Neural Encoders)
+- **Hidden Size**: 768
+- **Layers**: 12
+- **Attention Heads**: 12
+- **Max Position Embeddings**: 16,384 (though domains typically use <128)
+- **Vocabulary**: Direct Unicode code points (no vocabulary file needed)
+### Training Details
+- **Training Data**: humbleworth/registered-domains dataset (255M domains)
+- **Training Objective**: Masked Language Modeling (MLM) with 25% masking probability
+- **Masking Strategy**: Mix of contiguous spans (80%) and random characters (20%)
+- **Optimizer**: AdamW with learning rate 1e-5
+- **Batch Size**: 256 per device with gradient accumulation (effective batch size: 512)
+- **Hardware**: Optimized for NVIDIA A100 40GB
+- **Mixed Precision**: BF16 automatic mixed precision
+- **Training Framework**: PyTorch with custom training loop
+## Intended Uses & Limitations
+### Intended Uses
+- Domain name completion and suggestion
+- Understanding domain name patterns
+- Feature extraction for domain-related tasks
+- Fine-tuning for domain classification tasks
+- Domain name generation (with additional fine-tuning)
+### Limitations
+- This is an early checkpoint (epoch 1) - later checkpoints may perform better
+- Primarily trained on ASCII domain names
+- Limited to domains up to 128 characters
+- Not suitable for general text understanding tasks
+- Performance on internationalized domain names (IDN) may be limited
+## How to Use
+### Basic Usage
+```python
+import torch
+from transformers import CanineTokenizer, CanineModel, CanineConfig
+# Load tokenizer
+tokenizer = CanineTokenizer.from_pretrained('humbleworth/domain-mlm')
+# Load base CANINE model
+config = CanineConfig.from_pretrained('humbleworth/domain-mlm')
+model = CanineModel.from_pretrained('humbleworth/domain-mlm')
+# Encode a domain
+domain = "example.com"
+inputs = tokenizer(domain, return_tensors="pt")
+# Get character-level embeddings
+with torch.no_grad():
+    outputs = model(**inputs)
+    char_embeddings = outputs.last_hidden_state
+```
+### For Masked Language Modeling
+To use the model for masked character prediction, you'll need to load the custom MLM head:
+```python
+# Note: You'll need the custom CanineForMaskedLM class from the training code
+# The MLM head weights are stored in training_state.bin
+import sys
+sys.path.append('path/to/training/code')
+from train_mlm import CanineForMaskedLM
+# Load model with MLM head
+model = CanineForMaskedLM(config)
+model.canine = CanineModel.from_pretrained('humbleworth/domain-mlm')
+# Load MLM head weights
+state_dict = torch.load('training_state.bin', map_location='cpu')
+model.mlm_head.load_state_dict(state_dict['mlm_head_state_dict'])
+# Predict masked characters
+masked_domain = "goo[MASK]le.com"  # [MASK] will be replaced with U+E000
+# ... prediction code ...
+```
+## Training Data
+The model was trained on the [humbleworth/registered-domains](https://huggingface.co/datasets/humbleworth/registered-domains) dataset, which contains:
+### Dataset Statistics
+- **Total Size**: 255,097,510 unique registered domain names
+- **File Size**: 4.1 GB
+- **Source**: [Domains Project](https://domainsproject.org/)
+- **Character Set**: 100% ASCII (no internationalized domains)
+- **Average Length**: 15.9 characters (range: 4-77 characters)
+### TLD Distribution
+- **Total Unique TLDs**: 1,274
+- **Top TLDs**:
+  - .com: 139,092,425 (54.5%)
+  - .net: 12,240,626 (4.8%)
+  - .de: 11,349,715 (4.4%)
+  - .org: 10,107,145 (4.0%)
+  - .nl: 3,739,084 (1.5%)
+### Domain Characteristics
+- **Domains with numbers**: 22,570,972 (8.8%)
+- **Domains with hyphens**: 29,207,936 (11.4%)
+- **Character patterns**: Lowercase letters, numbers, hyphens, and dots only
+This comprehensive dataset provides excellent coverage of real-world domain patterns, making it ideal for training character-level models to understand domain name structures and conventions.
+## Evaluation
+This is an intermediate checkpoint. Full evaluation metrics will be available with the final model release. The model achieved reasonable perplexity on the validation set during training.
+## Technical Specifications
+### Model Architecture
+- 12 transformer layers
+- 768 hidden dimensions
+- 12 attention heads
+- GELU activation
+- Layer normalization
+- Dropout: 0.1
+### Infrastructure
+- Trained on NVIDIA A100 40GB GPU
+- PyTorch 2.0+
+- Mixed precision training (BF16)
+- Custom training loop implementation
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{domain-mlm-2024,
+  title={Domain MLM: Character-Level Language Modeling for Domain Names},
+  author={humbleworth},
+  year={2024},
+  publisher={Hugging Face},
+  howpublished={\url{https://huggingface.co/humbleworth/domain-mlm}}
+}
+```
+## License
+This model is released under the Apache 2.0 license.
+## Acknowledgments
+- Based on Google's CANINE-c model
+- Trained using the humbleworth/registered-domains dataset
+- Optimized training code for NVIDIA A100 GPUs

config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "architectures": [
+    "CanineModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 57344,
+  "downsampling_rate": 4,
+  "eos_token_id": 57345,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "local_transformer_stride": 128,
+  "max_position_embeddings": 16384,
+  "model_type": "canine",
+  "num_attention_heads": 12,
+  "num_hash_buckets": 16384,
+  "num_hash_functions": 8,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "torch_dtype": "float32",
+  "transformers_version": "4.53.1",
+  "type_vocab_size": 16,
+  "upsampling_kernel_size": 4,
+  "use_cache": true
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:12ff2a2f7c7820d20880edd8f7521069c57ba546a67a8269c23b8ec26fdd5225
+size 528359880

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "bos_token": {
+    "content": "",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "",
+    "lstrip": true,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "\u0000",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,47 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "\u0000",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "57344": {
+      "content": "",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "57345": {
+      "content": "",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "57347": {
+      "content": "",
+      "lstrip": true,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "",
+  "eos_token": "",
+  "extra_special_tokens": {},
+  "mask_token": "",
+  "model_max_length": 2048,
+  "pad_token": "\u0000",
+  "sep_token": "",
+  "tokenizer_class": "CanineTokenizer"
+}

training_state.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:123b27f1a56b452ba4c9c862773b14246d00538b73e877dd9c6fd6889a3e0fbc
+size 1210452589