Word Embedding Model
Overview
This is a custom word embedding model trained on a Wikipedia dataset. The model learns vector representations of words using a neural network-based embedding layer.
Model Details
- License: Apache 2.0
- Library: PyTorch
- Model Type: Word Embedding
- Base Model: None
- Vocabulary Size: 10,000 words
- Embedding Dimension: 300
- Framework: PyTorch
- Optimizer: Adam
- Loss Function: Cross-entropy loss (if trained with context prediction)
- Python Version: 3.11.9
- PyTorch Version: 12.6 (with CUDA support)
Training Details
The model was trained on a subset of 5,000 samples from the Wikipedia dataset using the following hardware specifications:
- GPU: NVIDIA GeForce RTX 4060 8GB OC Edition
- RAM: DDR5 Dual Channel 32GB (5200MHz)
- Processor: Intel i5-1200F
- Training Time: 2 hours
- Model Parameters: 3M Parameters
Downloading the Model
You can manually download the model or use one of the following commands:
## Create a Python virtual environment
python -m venv ./env
## Activate the virtual environment
# (for bash terminal)
source ./env/Scripts/activate
## Install dependencies
pip install -r requirements.txt
## Clone the HuggingFace Repo
git clone https://huggingface.co/jaweed123/small-embedding-model-10K-vocab && cd small-embedding-model-10K-vocab/
Usage
Load and Test the Model
import torch
from embedding_model import WordEmbeddingModel
from dataset_loader import WikipediaDataset
# Load dataset to get the vocabulary
dataset = WikipediaDataset()
vocab = dataset.build_vocab()
word_to_index = vocab
index_to_word = {idx: word for word, idx in vocab.items()} # Reverse mapping
# Load trained model
VOCAB_SIZE = len(vocab)
model = WordEmbeddingModel(vocab_size=VOCAB_SIZE)
# Load trained weights (Handling key mismatches)
MODEL_FILE = "word_embeddings_final.pth"
state_dict = torch.load(MODEL_FILE)
# Fix key mismatch if needed
if "embeddings.weight" in state_dict:
state_dict["embedding.weight"] = state_dict.pop("embeddings.weight")
# Load the state dictionary
model.load_state_dict(state_dict, strict=False)
model.eval() # Set model to evaluation mode
# Function to get word embedding
def get_embedding(word):
if word not in word_to_index:
print(f"โ Word '{word}' not in vocabulary!")
return None
word_idx = torch.tensor([word_to_index[word]], dtype=torch.long)
with torch.no_grad():
embedding = model(word_idx)
return embedding.numpy()
# Test words
test_words = ["king", "queen", "apple", "unknownword"]
for word in test_words:
embedding = get_embedding(word)
if embedding is not None:
print(f"๐น Word: {word} โ Embedding: {embedding[:5]} ...") # Show first 5 values
Applications
This model is designed for NLP applications requiring word embeddings, such as:
- Text similarity
- Similarity Search
Limitations & Bias
- The model is trained on Wikipedia data and may inherit biases present in the corpus.
- Limited vocabulary size may lead to out-of-vocabulary (OOV) words.
Citation
If you use this model in research, please cite:
@misc{small-embedding-model-10K-vocab,
title={Custom Word Embedding Model},
author={Abdul Jaweed},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/jaweed123/small-embedding-model-10K-vocab}
}
- Downloads last month
- 1
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support