Character Embedding Model
A character level embedding for ASCII characters trained on Oxford English Dictionary.
Model Description
This model uses a Transformer-based architecture to create embeddings that capture the contextual relationship between characters and their positions in words. It's trained using a contrastive learning approach where:
- Positive pairs: A character and its corresponding word with that character blanked out
- Negative pairs: Different characters that should have dissimilar embeddings
Architecture
- Embedding Dimension: 8
- Hidden Dimension: 64
- Transformer Layers: 2
- Attention Heads: 8
- Vocabulary Size: 257 (256 ASCII + blank token)
Training
The model was trained on word-definition pairs from a dictionary corpus using:
- Mixed precision training (FP16)
- Contrastive loss with margin-based negative sampling
- Periodic embedding stabilization
- Best model selection based on quality score (positive similarity - negative similarity)
Installation
pip install torch numpy huggingface_hub
Usage
Loading the Model
import torch
import numpy as np
import torch.nn.functional as F
from huggingface_hub import hf_hub_download
# Example for downloading a single file
local_path = hf_hub_download(repo_id="npc0/CharEmb", filename="char_embeddings_best.npy")
# Load the pre-computed character embeddings
char_embeddings = np.load(local_path, allow_pickle=True).item()
# Convert to tensor for efficient operations
char_embedding_tensor = {}
for char, emb in char_embeddings.items():
char_embedding_tensor[char] = torch.tensor(emb, dtype=torch.float32)
Inference: Character to Embedding
def get_character_embedding(char):
"""Get the embedding for a single character."""
if char in char_embedding_tensor:
return char_embedding_tensor[char]
else:
print(f"Warning: Character '{char}' not found in embeddings")
return None
# Example usage
char = 'a'
embedding = get_character_embedding(char)
print(f"Embedding for '{char}': {embedding}")
print(f"Embedding shape: {embedding.shape}") # Should be (8,)
Inference: Embedding to Character (Decoding)
def decode_embedding(query_embedding, top_k=5):
"""
Find the closest character(s) to a given embedding.
Args:
query_embedding: torch.Tensor of shape (8,)
top_k: Number of closest characters to return
Returns:
List of (character, similarity_score) tuples
"""
# Normalize query embedding
query_embedding = F.normalize(query_embedding.unsqueeze(0), p=2, dim=-1)
similarities = []
for char, emb in char_embedding_tensor.items():
# Normalize character embedding
emb_norm = F.normalize(emb.unsqueeze(0), p=2, dim=-1)
# Compute cosine similarity
sim = F.cosine_similarity(query_embedding, emb_norm, dim=-1).item()
similarities.append((char, sim))
# Sort by similarity (descending)
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:top_k]
# Example usage
test_char = 'e'
test_embedding = get_character_embedding(test_char)
if test_embedding is not None:
top_matches = decode_embedding(test_embedding, top_k=5)
print(f"\nTop 5 characters similar to '{test_char}':")
for char, sim in top_matches:
print(f" '{char}': {sim:.4f}")
Model File
char_embeddings_best.npy: Pre-computed character embeddings (numpy dictionary)
Limitations
- The model only supports ASCII characters (0-255) plus a special blank token
- Embeddings are context-averaged, so they may not capture all nuances of character usage
- Performance Limited by the diversity and quality of training data of Oxford English Dictionary
- The model uses a relatively small embedding dimension (8) for efficiency
Citation
@misc{character_embedding_model,
title={Character Embedding Model with Blank-Filling},
author={Yuan Xu},
year={2025},
howpublished={\url{https://huggingface.co/your-username/character-embedding}}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support