Word Embedding Model

Overview

This is a custom word embedding model trained on a Wikipedia dataset. The model learns vector representations of words using a neural network-based embedding layer.

Model Details

  • License: Apache 2.0
  • Library: PyTorch
  • Model Type: Word Embedding
  • Base Model: None
  • Vocabulary Size: 10,000 words
  • Embedding Dimension: 300
  • Framework: PyTorch
  • Optimizer: Adam
  • Loss Function: Cross-entropy loss (if trained with context prediction)
  • Python Version: 3.11.9
  • PyTorch Version: 12.6 (with CUDA support)

Training Details

The model was trained on a subset of 5,000 samples from the Wikipedia dataset using the following hardware specifications:

  • GPU: NVIDIA GeForce RTX 4060 8GB OC Edition
  • RAM: DDR5 Dual Channel 32GB (5200MHz)
  • Processor: Intel i5-1200F
  • Training Time: 2 hours
  • Model Parameters: 3M Parameters

Downloading the Model

You can manually download the model or use one of the following commands:

## Create a Python virtual environment
python -m venv ./env 

## Activate the virtual environment
# (for bash terminal)
source ./env/Scripts/activate

## Install dependencies
pip install -r requirements.txt

## Clone the HuggingFace Repo
git clone https://huggingface.co/jaweed123/small-embedding-model-10K-vocab && cd small-embedding-model-10K-vocab/

Usage

Load and Test the Model

import torch
from embedding_model import WordEmbeddingModel
from dataset_loader import WikipediaDataset

# Load dataset to get the vocabulary
dataset = WikipediaDataset()
vocab = dataset.build_vocab()
word_to_index = vocab
index_to_word = {idx: word for word, idx in vocab.items()}  # Reverse mapping

# Load trained model
VOCAB_SIZE = len(vocab)
model = WordEmbeddingModel(vocab_size=VOCAB_SIZE)

# Load trained weights (Handling key mismatches)
MODEL_FILE = "word_embeddings_final.pth"
state_dict = torch.load(MODEL_FILE)

# Fix key mismatch if needed
if "embeddings.weight" in state_dict:
    state_dict["embedding.weight"] = state_dict.pop("embeddings.weight")

# Load the state dictionary
model.load_state_dict(state_dict, strict=False)
model.eval()  # Set model to evaluation mode

# Function to get word embedding
def get_embedding(word):
    if word not in word_to_index:
        print(f"โŒ Word '{word}' not in vocabulary!")
        return None

    word_idx = torch.tensor([word_to_index[word]], dtype=torch.long)
    with torch.no_grad():
        embedding = model(word_idx)
    
    return embedding.numpy()

# Test words
test_words = ["king", "queen", "apple", "unknownword"]
for word in test_words:
    embedding = get_embedding(word)
    if embedding is not None:
        print(f"๐Ÿ”น Word: {word} โ†’ Embedding: {embedding[:5]} ...")  # Show first 5 values

Applications

This model is designed for NLP applications requiring word embeddings, such as:

  • Text similarity
  • Similarity Search

Limitations & Bias

  • The model is trained on Wikipedia data and may inherit biases present in the corpus.
  • Limited vocabulary size may lead to out-of-vocabulary (OOV) words.

Citation

If you use this model in research, please cite:

@misc{small-embedding-model-10K-vocab,
  title={Custom Word Embedding Model},
  author={Abdul Jaweed},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/jaweed123/small-embedding-model-10K-vocab}
}
Downloads last month
1
Safetensors
Model size
3M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train jaweed123/small-embedding-model-10K-vocab