File size: 8,716 Bytes
727afc4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
---
license: apache-2.0
language:
- ur
base_model:
- facebook/fasttext-km-vectors
tags:
- art
---
---
# Card Metadata (Optional but Recommended)
# You can fill these out directly in the Hugging Face UI or here.
# Language: ur (Urdu)
# Tasks:
# - word-embeddings
# Library:
# - fasttext
# Datasets:
# - [Specify your dataset name here, e.g., your-dataset-name-on-hf, or just 'Custom Corpus']
# Tags:
# - urdu
# - word-vectors
# - embeddings
# - fasttext
# - unsupervised
# - urdu-nlp
# License: [Specify your license here, e.g., mit, apache-2.0, cc-by-4.0]
---
# Urdu Word Embeddings (fastText)
## Model Description
This is an unsupervised word embedding model for the Urdu language, trained using the fastText library. It generates high-dimensional vectors for Urdu words, capturing semantic and syntactic relationships based on their context in the training data.
Unlike traditional Word2Vec, this fastText model was trained with character n-grams (`minn=[Your minn]`, `maxn=[Your maxn]`), which is particularly beneficial for morphologically rich languages like Urdu. This allows the model to:
- Learn representations for subword units.
- Generate meaningful vectors for words it hasn't seen during training (Out-of-Vocabulary or OOV words) by composing vectors from their character n-grams.
The model outputs vectors of dimension `[Your vector_size]`.
## Intended Use
This model is intended for use in various Urdu Natural Language Processing (NLP) tasks, including:
- Measuring semantic similarity between Urdu words.
- Using word vectors as features for downstream tasks such as text classification, clustering, or named entity recognition.
- Exploring word relationships and patterns within the vocabulary learned from the training corpus.
- Obtaining vector representations for potentially unseen words based on their subword components.
## Training Data
This model was trained on a custom text corpus of Urdu sentences.
- **Dataset Source:** [Specify the source of your training data here. For example: "Collected from the COUNTER (COrpus of Urdu News TExt Reuse) dataset" or "A custom corpus gathered from [mention sources or domain]"].
- **Data Format:** The training data was processed into a single text file (`train.txt`) where each line represented a sentence or document, and words were separated by spaces.
- **Preprocessing:** Basic preprocessing was applied, including replacing common punctuation marks with spaces and normalizing whitespace. [Mention any other specific preprocessing steps you performed, e.g., lowercasing (less common for Urdu), handling numbers, removing specific symbols].
[If your training data is publicly available or derived from a public source, provide a link or instructions on how others can access it.]
[If the data is private, state that the data itself cannot be shared but the resulting model is being released.]
## Training Procedure
The model was trained using the unsupervised capabilities of the fastText library.
- **Algorithm:** Continuous Bag of Words (CBOW) model (`model=cbow`). [If you used `skipgram`, specify that instead and briefly explain why, e.g., "Skip-gram model (`model=skipgram`), often better for capturing representations of rare words."]
- **Parameters:** The following parameters were used during training:
- `dim`: `[Your vector_size]` (Vector dimensionality)
- `ws`: `[Your window_size]` (Context window size)
- `minCount`: `[Your min_word_count]` (Minimum word frequency to be included in vocabulary)
- `epoch`: `[Your epochs]` (Number of training epochs)
- `neg`: `[Your negative_samples]` (Number of negative samples)
- `minn`: `[Your minn]` (Minimum character n-gram length)
- `maxn`: `[Your maxn]` (Maximum character n-gram length)
- `thread`: 4 (Number of threads used)
- [List any other significant parameters you modified]
- **Training Environment:** The training was performed in a Google Colab environment.
## How to Use
You can load and use this model using the fastText Python library.
First, make sure you have fastText installed:
```bash
pip install fasttext
import fasttext
import numpy as np # For calculating cosine similarity
# Path to the downloaded .bin model file
model_path = "path/to/your/downloaded/urdu_fasttext.bin"
# Load the fastText model
try:
model = fasttext.load_model(model_path)
print("Model loaded successfully!")
except ValueError as e:
print(f"Error loading model: {e}")
print("Ensure the file exists and is a valid fastText binary model.")
model = None # Set model to None if loading fails
if model:
# --- Get Word Vector ---
word = "پاکستان" # Example Urdu word
print(f"\nVector for '{word}':")
try:
vector = model.get_word_vector(word)
print(f"Shape: {vector.shape}")
print(f"First 10 dimensions: {vector[:10]}")
except ValueError as e:
print(f"Error getting vector for '{word}': {e}. Word might be too short or have no valid subwords.")
# --- Find Nearest Neighbors (Similar Words) ---
word_for_neighbors = "اردو" # Example Urdu word
print(f"\nWords similar to '{word_for_neighbors}':")
try:
# Get top 10 most similar words
neighbors = model.get_nearest_neighbors(word_for_neighbors, k=10)
if neighbors:
print(neighbors)
else:
print(f"No similar words found for '{word_for_neighbors}'.")
except ValueError as e:
print(f"Error finding similar words for '{word_for_neighbors}': {e}. Word might not be valid.")
# --- Calculate Similarity Between Two Words (Manual Cosine Similarity) ---
word1 = "علم" # Example word 1
word2 = "روشنی" # Example word 2
print(f"\nSimilarity between '{word1}' and '{word2}':")
try:
vec1 = model.get_word_vector(word1)
vec2 = model.get_word_vector(word2)
# Calculate cosine similarity
norm1 = np.linalg.norm(vec1)
norm2 = np.linalg.norm(vec2)
if norm1 > 0 and norm2 > 0:
cosine_similarity = np.dot(vec1, vec2) / (norm1 * norm2)
print(f"Cosine similarity: {cosine_similarity}")
else:
print("Cannot compute similarity: zero vector detected for one or both words.")
except ValueError as e:
print(f"Error calculating similarity between '{word1}' and '{word2}': {e}. One or both words might not be valid.")
# --- Using the .vec file (Optional) ---
# The .vec file contains just the word vectors for words in the vocabulary.
# It can be loaded by other libraries like Gensim or spaCy.
# Note: This method *does not* utilize fastText's subword capabilities for OOV words.
# For fastText specific features, use the .bin file.
# Example (using gensim - requires gensim installation):
# from gensim.models import KeyedVectors
# vec_file_path = "path/to/your/downloaded/urdu_fasttext.vec"
# try:
# # Load vectors in Word2Vec text format
# word_vectors = KeyedVectors.load_word2vec_format(vec_file_path, binary=False)
# print(f"\nLoaded {len(word_vectors.key_to_index)} vectors from .vec file using Gensim.")
# # Example: Find similar words using Gensim
# # print(word_vectors.most_similar("اردو"))
# except Exception as e:
# print(f"Error loading .vec file with Gensim: {e}")
else:
print("\nModel could not be loaded. Usage examples are skipped.")
**Steps after creating the Model Card content:**
1. **Create a Model Repository on Hugging Face:** Go to huggingface.co, log in, click your profile picture -> "New model".
2. **Name your Model:** Choose a descriptive name (e.g., `urdu-fasttext-word-embeddings`).
3. **Set Visibility:** Choose Public or Private.
4. **Create Model:** This creates an empty repository.
5. **Upload Files:** Go to the "Files" tab of your new repository. You can either:
* Click "Add file" and upload `urdu_fasttext.bin`, `urdu_fasttext.vec`, and your training script file.
* Or, clone the repository locally and push the files using Git.
6. **Edit Model Card:** Go to the "Model card" tab. This is where you paste and format the content prepared above. You can edit it directly in the browser using Markdown.
7. **Fill in Placeholders:** Go through the content and replace all `[ ... ]` placeholders with your specific details (vector size, epochs, dataset source, license, your name, etc.).
8. **Format with Markdown:** Use the formatting options (headers, bold, code blocks) to make the card readable.
9. **Save Model Card:** Save the changes.
Your model will then be available on Hugging Face with the documentation you've provided. |