Character Embedding Model

A character level embedding for ASCII characters trained on Oxford English Dictionary.

Model Description

This model uses a Transformer-based architecture to create embeddings that capture the contextual relationship between characters and their positions in words. It's trained using a contrastive learning approach where:

Positive pairs: A character and its corresponding word with that character blanked out
Negative pairs: Different characters that should have dissimilar embeddings

Architecture

Embedding Dimension: 8
Hidden Dimension: 64
Transformer Layers: 2
Attention Heads: 8
Vocabulary Size: 257 (256 ASCII + blank token)

Training

The model was trained on word-definition pairs from a dictionary corpus using:

Mixed precision training (FP16)
Contrastive loss with margin-based negative sampling
Periodic embedding stabilization
Best model selection based on quality score (positive similarity - negative similarity)

Installation

pip install torch numpy huggingface_hub

Usage

Loading the Model

import torch
import numpy as np
import torch.nn.functional as F
from huggingface_hub import hf_hub_download

# Example for downloading a single file
local_path = hf_hub_download(repo_id="npc0/CharEmb", filename="char_embeddings_best.npy")

# Load the pre-computed character embeddings
char_embeddings = np.load(local_path, allow_pickle=True).item()

# Convert to tensor for efficient operations
char_embedding_tensor = {}
for char, emb in char_embeddings.items():
    char_embedding_tensor[char] = torch.tensor(emb, dtype=torch.float32)

Inference: Character to Embedding

def get_character_embedding(char):
    """Get the embedding for a single character."""
    if char in char_embedding_tensor:
        return char_embedding_tensor[char]
    else:
        print(f"Warning: Character '{char}' not found in embeddings")
        return None

# Example usage
char = 'a'
embedding = get_character_embedding(char)
print(f"Embedding for '{char}': {embedding}")
print(f"Embedding shape: {embedding.shape}")  # Should be (8,)

Inference: Embedding to Character (Decoding)

def decode_embedding(query_embedding, top_k=5):
    """
    Find the closest character(s) to a given embedding.
    
    Args:
        query_embedding: torch.Tensor of shape (8,)
        top_k: Number of closest characters to return
    
    Returns:
        List of (character, similarity_score) tuples
    """
    # Normalize query embedding
    query_embedding = F.normalize(query_embedding.unsqueeze(0), p=2, dim=-1)
    
    similarities = []
    for char, emb in char_embedding_tensor.items():
        # Normalize character embedding
        emb_norm = F.normalize(emb.unsqueeze(0), p=2, dim=-1)
        # Compute cosine similarity
        sim = F.cosine_similarity(query_embedding, emb_norm, dim=-1).item()
        similarities.append((char, sim))
    
    # Sort by similarity (descending)
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    return similarities[:top_k]

# Example usage
test_char = 'e'
test_embedding = get_character_embedding(test_char)

if test_embedding is not None:
    top_matches = decode_embedding(test_embedding, top_k=5)
    print(f"\nTop 5 characters similar to '{test_char}':")
    for char, sim in top_matches:
        print(f"  '{char}': {sim:.4f}")

Model File

char_embeddings_best.npy: Pre-computed character embeddings (numpy dictionary)

Limitations

The model only supports ASCII characters (0-255) plus a special blank token
Embeddings are context-averaged, so they may not capture all nuances of character usage
Performance Limited by the diversity and quality of training data of Oxford English Dictionary
The model uses a relatively small embedding dimension (8) for efficiency

Citation

@misc{character_embedding_model,
  title={Character Embedding Model with Blank-Filling},
  author={Yuan Xu},
  year={2025},
  howpublished={\url{https://huggingface.co/your-username/character-embedding}}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support