|
--- |
|
license: apache-2.0 |
|
language: |
|
- ur |
|
base_model: |
|
- facebook/fasttext-km-vectors |
|
tags: |
|
- art |
|
--- |
|
--- |
|
# Card Metadata (Optional but Recommended) |
|
# You can fill these out directly in the Hugging Face UI or here. |
|
# Language: ur (Urdu) |
|
# Tasks: |
|
# - word-embeddings |
|
# Library: |
|
# - fasttext |
|
# Datasets: |
|
# - [Specify your dataset name here, e.g., your-dataset-name-on-hf, or just 'Custom Corpus'] |
|
# Tags: |
|
# - urdu |
|
# - word-vectors |
|
# - embeddings |
|
# - fasttext |
|
# - unsupervised |
|
# - urdu-nlp |
|
# License: [Specify your license here, e.g., mit, apache-2.0, cc-by-4.0] |
|
--- |
|
|
|
# Urdu Word Embeddings (fastText) |
|
|
|
## Model Description |
|
|
|
This is an unsupervised word embedding model for the Urdu language, trained using the fastText library. It generates high-dimensional vectors for Urdu words, capturing semantic and syntactic relationships based on their context in the training data. |
|
|
|
Unlike traditional Word2Vec, this fastText model was trained with character n-grams (`minn=[Your minn]`, `maxn=[Your maxn]`), which is particularly beneficial for morphologically rich languages like Urdu. This allows the model to: |
|
- Learn representations for subword units. |
|
- Generate meaningful vectors for words it hasn't seen during training (Out-of-Vocabulary or OOV words) by composing vectors from their character n-grams. |
|
|
|
The model outputs vectors of dimension `[Your vector_size]`. |
|
|
|
## Intended Use |
|
|
|
This model is intended for use in various Urdu Natural Language Processing (NLP) tasks, including: |
|
- Measuring semantic similarity between Urdu words. |
|
- Using word vectors as features for downstream tasks such as text classification, clustering, or named entity recognition. |
|
- Exploring word relationships and patterns within the vocabulary learned from the training corpus. |
|
- Obtaining vector representations for potentially unseen words based on their subword components. |
|
|
|
## Training Data |
|
|
|
This model was trained on a custom text corpus of Urdu sentences. |
|
|
|
- **Dataset Source:** [Specify the source of your training data here. For example: "Collected from the COUNTER (COrpus of Urdu News TExt Reuse) dataset" or "A custom corpus gathered from [mention sources or domain]"]. |
|
- **Data Format:** The training data was processed into a single text file (`train.txt`) where each line represented a sentence or document, and words were separated by spaces. |
|
- **Preprocessing:** Basic preprocessing was applied, including replacing common punctuation marks with spaces and normalizing whitespace. [Mention any other specific preprocessing steps you performed, e.g., lowercasing (less common for Urdu), handling numbers, removing specific symbols]. |
|
|
|
[If your training data is publicly available or derived from a public source, provide a link or instructions on how others can access it.] |
|
[If the data is private, state that the data itself cannot be shared but the resulting model is being released.] |
|
|
|
## Training Procedure |
|
|
|
The model was trained using the unsupervised capabilities of the fastText library. |
|
|
|
- **Algorithm:** Continuous Bag of Words (CBOW) model (`model=cbow`). [If you used `skipgram`, specify that instead and briefly explain why, e.g., "Skip-gram model (`model=skipgram`), often better for capturing representations of rare words."] |
|
- **Parameters:** The following parameters were used during training: |
|
- `dim`: `[Your vector_size]` (Vector dimensionality) |
|
- `ws`: `[Your window_size]` (Context window size) |
|
- `minCount`: `[Your min_word_count]` (Minimum word frequency to be included in vocabulary) |
|
- `epoch`: `[Your epochs]` (Number of training epochs) |
|
- `neg`: `[Your negative_samples]` (Number of negative samples) |
|
- `minn`: `[Your minn]` (Minimum character n-gram length) |
|
- `maxn`: `[Your maxn]` (Maximum character n-gram length) |
|
- `thread`: 4 (Number of threads used) |
|
- [List any other significant parameters you modified] |
|
|
|
- **Training Environment:** The training was performed in a Google Colab environment. |
|
|
|
## How to Use |
|
|
|
You can load and use this model using the fastText Python library. |
|
|
|
First, make sure you have fastText installed: |
|
```bash |
|
pip install fasttext |
|
import fasttext |
|
import numpy as np # For calculating cosine similarity |
|
|
|
# Path to the downloaded .bin model file |
|
model_path = "path/to/your/downloaded/urdu_fasttext.bin" |
|
|
|
# Load the fastText model |
|
try: |
|
model = fasttext.load_model(model_path) |
|
print("Model loaded successfully!") |
|
except ValueError as e: |
|
print(f"Error loading model: {e}") |
|
print("Ensure the file exists and is a valid fastText binary model.") |
|
model = None # Set model to None if loading fails |
|
|
|
|
|
if model: |
|
# --- Get Word Vector --- |
|
word = "پاکستان" # Example Urdu word |
|
print(f"\nVector for '{word}':") |
|
try: |
|
vector = model.get_word_vector(word) |
|
print(f"Shape: {vector.shape}") |
|
print(f"First 10 dimensions: {vector[:10]}") |
|
except ValueError as e: |
|
print(f"Error getting vector for '{word}': {e}. Word might be too short or have no valid subwords.") |
|
|
|
|
|
# --- Find Nearest Neighbors (Similar Words) --- |
|
word_for_neighbors = "اردو" # Example Urdu word |
|
print(f"\nWords similar to '{word_for_neighbors}':") |
|
try: |
|
# Get top 10 most similar words |
|
neighbors = model.get_nearest_neighbors(word_for_neighbors, k=10) |
|
if neighbors: |
|
print(neighbors) |
|
else: |
|
print(f"No similar words found for '{word_for_neighbors}'.") |
|
except ValueError as e: |
|
print(f"Error finding similar words for '{word_for_neighbors}': {e}. Word might not be valid.") |
|
|
|
|
|
# --- Calculate Similarity Between Two Words (Manual Cosine Similarity) --- |
|
word1 = "علم" # Example word 1 |
|
word2 = "روشنی" # Example word 2 |
|
print(f"\nSimilarity between '{word1}' and '{word2}':") |
|
try: |
|
vec1 = model.get_word_vector(word1) |
|
vec2 = model.get_word_vector(word2) |
|
|
|
# Calculate cosine similarity |
|
norm1 = np.linalg.norm(vec1) |
|
norm2 = np.linalg.norm(vec2) |
|
|
|
if norm1 > 0 and norm2 > 0: |
|
cosine_similarity = np.dot(vec1, vec2) / (norm1 * norm2) |
|
print(f"Cosine similarity: {cosine_similarity}") |
|
else: |
|
print("Cannot compute similarity: zero vector detected for one or both words.") |
|
except ValueError as e: |
|
print(f"Error calculating similarity between '{word1}' and '{word2}': {e}. One or both words might not be valid.") |
|
|
|
# --- Using the .vec file (Optional) --- |
|
# The .vec file contains just the word vectors for words in the vocabulary. |
|
# It can be loaded by other libraries like Gensim or spaCy. |
|
# Note: This method *does not* utilize fastText's subword capabilities for OOV words. |
|
# For fastText specific features, use the .bin file. |
|
# Example (using gensim - requires gensim installation): |
|
# from gensim.models import KeyedVectors |
|
# vec_file_path = "path/to/your/downloaded/urdu_fasttext.vec" |
|
# try: |
|
# # Load vectors in Word2Vec text format |
|
# word_vectors = KeyedVectors.load_word2vec_format(vec_file_path, binary=False) |
|
# print(f"\nLoaded {len(word_vectors.key_to_index)} vectors from .vec file using Gensim.") |
|
# # Example: Find similar words using Gensim |
|
# # print(word_vectors.most_similar("اردو")) |
|
# except Exception as e: |
|
# print(f"Error loading .vec file with Gensim: {e}") |
|
|
|
|
|
else: |
|
print("\nModel could not be loaded. Usage examples are skipped.") |
|
|
|
|
|
**Steps after creating the Model Card content:** |
|
|
|
1. **Create a Model Repository on Hugging Face:** Go to huggingface.co, log in, click your profile picture -> "New model". |
|
2. **Name your Model:** Choose a descriptive name (e.g., `urdu-fasttext-word-embeddings`). |
|
3. **Set Visibility:** Choose Public or Private. |
|
4. **Create Model:** This creates an empty repository. |
|
5. **Upload Files:** Go to the "Files" tab of your new repository. You can either: |
|
* Click "Add file" and upload `urdu_fasttext.bin`, `urdu_fasttext.vec`, and your training script file. |
|
* Or, clone the repository locally and push the files using Git. |
|
6. **Edit Model Card:** Go to the "Model card" tab. This is where you paste and format the content prepared above. You can edit it directly in the browser using Markdown. |
|
7. **Fill in Placeholders:** Go through the content and replace all `[ ... ]` placeholders with your specific details (vector size, epochs, dataset source, license, your name, etc.). |
|
8. **Format with Markdown:** Use the formatting options (headers, bold, code blocks) to make the card readable. |
|
9. **Save Model Card:** Save the changes. |
|
|
|
Your model will then be available on Hugging Face with the documentation you've provided. |