Word Embedding Model

Overview

This is a custom word embedding model trained on a Wikipedia dataset. The model learns vector representations of words using a neural network-based embedding layer.

Model Details

License: Apache 2.0
Library: PyTorch
Model Type: Word Embedding
Base Model: None
Vocabulary Size: 10,000 words
Embedding Dimension: 300
Framework: PyTorch
Optimizer: Adam
Loss Function: Cross-entropy loss (if trained with context prediction)
Python Version: 3.11.9
PyTorch Version: 12.6 (with CUDA support)

Training Details

The model was trained on a subset of 5,000 samples from the Wikipedia dataset using the following hardware specifications:

GPU: NVIDIA GeForce RTX 4060 8GB OC Edition
RAM: DDR5 Dual Channel 32GB (5200MHz)
Processor: Intel i5-1200F
Training Time: 2 hours
Model Parameters: 3M Parameters

Downloading the Model

You can manually download the model or use one of the following commands:

## Create a Python virtual environment
python -m venv ./env 

## Activate the virtual environment
# (for bash terminal)
source ./env/Scripts/activate

## Install dependencies
pip install -r requirements.txt

## Clone the HuggingFace Repo
git clone https://huggingface.co/jaweed123/small-embedding-model-10K-vocab && cd small-embedding-model-10K-vocab/

Usage

Load and Test the Model

import torch
from embedding_model import WordEmbeddingModel
from dataset_loader import WikipediaDataset

# Load dataset to get the vocabulary
dataset = WikipediaDataset()
vocab = dataset.build_vocab()
word_to_index = vocab
index_to_word = {idx: word for word, idx in vocab.items()}  # Reverse mapping

# Load trained model
VOCAB_SIZE = len(vocab)
model = WordEmbeddingModel(vocab_size=VOCAB_SIZE)

# Load trained weights (Handling key mismatches)
MODEL_FILE = "word_embeddings_final.pth"
state_dict = torch.load(MODEL_FILE)

# Fix key mismatch if needed
if "embeddings.weight" in state_dict:
    state_dict["embedding.weight"] = state_dict.pop("embeddings.weight")

# Load the state dictionary
model.load_state_dict(state_dict, strict=False)
model.eval()  # Set model to evaluation mode

# Function to get word embedding
def get_embedding(word):
    if word not in word_to_index:
        print(f"❌ Word '{word}' not in vocabulary!")
        return None

    word_idx = torch.tensor([word_to_index[word]], dtype=torch.long)
    with torch.no_grad():
        embedding = model(word_idx)
    
    return embedding.numpy()

# Test words
test_words = ["king", "queen", "apple", "unknownword"]
for word in test_words:
    embedding = get_embedding(word)
    if embedding is not None:
        print(f"🔹 Word: {word} → Embedding: {embedding[:5]} ...")  # Show first 5 values

Applications

This model is designed for NLP applications requiring word embeddings, such as:

Text similarity
Similarity Search

Limitations & Bias

The model is trained on Wikipedia data and may inherit biases present in the corpus.
Limited vocabulary size may lead to out-of-vocabulary (OOV) words.

Citation

If you use this model in research, please cite:

@misc{small-embedding-model-10K-vocab,
  title={Custom Word Embedding Model},
  author={Abdul Jaweed},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/jaweed123/small-embedding-model-10K-vocab}
}

jaweed123
/

small-embedding-model-10K-vocab