Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,184 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- ur
|
5 |
+
base_model:
|
6 |
+
- facebook/fasttext-km-vectors
|
7 |
+
tags:
|
8 |
+
- art
|
9 |
+
---
|
10 |
+
---
|
11 |
+
# Card Metadata (Optional but Recommended)
|
12 |
+
# You can fill these out directly in the Hugging Face UI or here.
|
13 |
+
# Language: ur (Urdu)
|
14 |
+
# Tasks:
|
15 |
+
# - word-embeddings
|
16 |
+
# Library:
|
17 |
+
# - fasttext
|
18 |
+
# Datasets:
|
19 |
+
# - [Specify your dataset name here, e.g., your-dataset-name-on-hf, or just 'Custom Corpus']
|
20 |
+
# Tags:
|
21 |
+
# - urdu
|
22 |
+
# - word-vectors
|
23 |
+
# - embeddings
|
24 |
+
# - fasttext
|
25 |
+
# - unsupervised
|
26 |
+
# - urdu-nlp
|
27 |
+
# License: [Specify your license here, e.g., mit, apache-2.0, cc-by-4.0]
|
28 |
+
---
|
29 |
+
|
30 |
+
# Urdu Word Embeddings (fastText)
|
31 |
+
|
32 |
+
## Model Description
|
33 |
+
|
34 |
+
This is an unsupervised word embedding model for the Urdu language, trained using the fastText library. It generates high-dimensional vectors for Urdu words, capturing semantic and syntactic relationships based on their context in the training data.
|
35 |
+
|
36 |
+
Unlike traditional Word2Vec, this fastText model was trained with character n-grams (`minn=[Your minn]`, `maxn=[Your maxn]`), which is particularly beneficial for morphologically rich languages like Urdu. This allows the model to:
|
37 |
+
- Learn representations for subword units.
|
38 |
+
- Generate meaningful vectors for words it hasn't seen during training (Out-of-Vocabulary or OOV words) by composing vectors from their character n-grams.
|
39 |
+
|
40 |
+
The model outputs vectors of dimension `[Your vector_size]`.
|
41 |
+
|
42 |
+
## Intended Use
|
43 |
+
|
44 |
+
This model is intended for use in various Urdu Natural Language Processing (NLP) tasks, including:
|
45 |
+
- Measuring semantic similarity between Urdu words.
|
46 |
+
- Using word vectors as features for downstream tasks such as text classification, clustering, or named entity recognition.
|
47 |
+
- Exploring word relationships and patterns within the vocabulary learned from the training corpus.
|
48 |
+
- Obtaining vector representations for potentially unseen words based on their subword components.
|
49 |
+
|
50 |
+
## Training Data
|
51 |
+
|
52 |
+
This model was trained on a custom text corpus of Urdu sentences.
|
53 |
+
|
54 |
+
- **Dataset Source:** [Specify the source of your training data here. For example: "Collected from the COUNTER (COrpus of Urdu News TExt Reuse) dataset" or "A custom corpus gathered from [mention sources or domain]"].
|
55 |
+
- **Data Format:** The training data was processed into a single text file (`train.txt`) where each line represented a sentence or document, and words were separated by spaces.
|
56 |
+
- **Preprocessing:** Basic preprocessing was applied, including replacing common punctuation marks with spaces and normalizing whitespace. [Mention any other specific preprocessing steps you performed, e.g., lowercasing (less common for Urdu), handling numbers, removing specific symbols].
|
57 |
+
|
58 |
+
[If your training data is publicly available or derived from a public source, provide a link or instructions on how others can access it.]
|
59 |
+
[If the data is private, state that the data itself cannot be shared but the resulting model is being released.]
|
60 |
+
|
61 |
+
## Training Procedure
|
62 |
+
|
63 |
+
The model was trained using the unsupervised capabilities of the fastText library.
|
64 |
+
|
65 |
+
- **Algorithm:** Continuous Bag of Words (CBOW) model (`model=cbow`). [If you used `skipgram`, specify that instead and briefly explain why, e.g., "Skip-gram model (`model=skipgram`), often better for capturing representations of rare words."]
|
66 |
+
- **Parameters:** The following parameters were used during training:
|
67 |
+
- `dim`: `[Your vector_size]` (Vector dimensionality)
|
68 |
+
- `ws`: `[Your window_size]` (Context window size)
|
69 |
+
- `minCount`: `[Your min_word_count]` (Minimum word frequency to be included in vocabulary)
|
70 |
+
- `epoch`: `[Your epochs]` (Number of training epochs)
|
71 |
+
- `neg`: `[Your negative_samples]` (Number of negative samples)
|
72 |
+
- `minn`: `[Your minn]` (Minimum character n-gram length)
|
73 |
+
- `maxn`: `[Your maxn]` (Maximum character n-gram length)
|
74 |
+
- `thread`: 4 (Number of threads used)
|
75 |
+
- [List any other significant parameters you modified]
|
76 |
+
|
77 |
+
- **Training Environment:** The training was performed in a Google Colab environment.
|
78 |
+
|
79 |
+
## How to Use
|
80 |
+
|
81 |
+
You can load and use this model using the fastText Python library.
|
82 |
+
|
83 |
+
First, make sure you have fastText installed:
|
84 |
+
```bash
|
85 |
+
pip install fasttext
|
86 |
+
import fasttext
|
87 |
+
import numpy as np # For calculating cosine similarity
|
88 |
+
|
89 |
+
# Path to the downloaded .bin model file
|
90 |
+
model_path = "path/to/your/downloaded/urdu_fasttext.bin"
|
91 |
+
|
92 |
+
# Load the fastText model
|
93 |
+
try:
|
94 |
+
model = fasttext.load_model(model_path)
|
95 |
+
print("Model loaded successfully!")
|
96 |
+
except ValueError as e:
|
97 |
+
print(f"Error loading model: {e}")
|
98 |
+
print("Ensure the file exists and is a valid fastText binary model.")
|
99 |
+
model = None # Set model to None if loading fails
|
100 |
+
|
101 |
+
|
102 |
+
if model:
|
103 |
+
# --- Get Word Vector ---
|
104 |
+
word = "پاکستان" # Example Urdu word
|
105 |
+
print(f"\nVector for '{word}':")
|
106 |
+
try:
|
107 |
+
vector = model.get_word_vector(word)
|
108 |
+
print(f"Shape: {vector.shape}")
|
109 |
+
print(f"First 10 dimensions: {vector[:10]}")
|
110 |
+
except ValueError as e:
|
111 |
+
print(f"Error getting vector for '{word}': {e}. Word might be too short or have no valid subwords.")
|
112 |
+
|
113 |
+
|
114 |
+
# --- Find Nearest Neighbors (Similar Words) ---
|
115 |
+
word_for_neighbors = "اردو" # Example Urdu word
|
116 |
+
print(f"\nWords similar to '{word_for_neighbors}':")
|
117 |
+
try:
|
118 |
+
# Get top 10 most similar words
|
119 |
+
neighbors = model.get_nearest_neighbors(word_for_neighbors, k=10)
|
120 |
+
if neighbors:
|
121 |
+
print(neighbors)
|
122 |
+
else:
|
123 |
+
print(f"No similar words found for '{word_for_neighbors}'.")
|
124 |
+
except ValueError as e:
|
125 |
+
print(f"Error finding similar words for '{word_for_neighbors}': {e}. Word might not be valid.")
|
126 |
+
|
127 |
+
|
128 |
+
# --- Calculate Similarity Between Two Words (Manual Cosine Similarity) ---
|
129 |
+
word1 = "علم" # Example word 1
|
130 |
+
word2 = "روشنی" # Example word 2
|
131 |
+
print(f"\nSimilarity between '{word1}' and '{word2}':")
|
132 |
+
try:
|
133 |
+
vec1 = model.get_word_vector(word1)
|
134 |
+
vec2 = model.get_word_vector(word2)
|
135 |
+
|
136 |
+
# Calculate cosine similarity
|
137 |
+
norm1 = np.linalg.norm(vec1)
|
138 |
+
norm2 = np.linalg.norm(vec2)
|
139 |
+
|
140 |
+
if norm1 > 0 and norm2 > 0:
|
141 |
+
cosine_similarity = np.dot(vec1, vec2) / (norm1 * norm2)
|
142 |
+
print(f"Cosine similarity: {cosine_similarity}")
|
143 |
+
else:
|
144 |
+
print("Cannot compute similarity: zero vector detected for one or both words.")
|
145 |
+
except ValueError as e:
|
146 |
+
print(f"Error calculating similarity between '{word1}' and '{word2}': {e}. One or both words might not be valid.")
|
147 |
+
|
148 |
+
# --- Using the .vec file (Optional) ---
|
149 |
+
# The .vec file contains just the word vectors for words in the vocabulary.
|
150 |
+
# It can be loaded by other libraries like Gensim or spaCy.
|
151 |
+
# Note: This method *does not* utilize fastText's subword capabilities for OOV words.
|
152 |
+
# For fastText specific features, use the .bin file.
|
153 |
+
# Example (using gensim - requires gensim installation):
|
154 |
+
# from gensim.models import KeyedVectors
|
155 |
+
# vec_file_path = "path/to/your/downloaded/urdu_fasttext.vec"
|
156 |
+
# try:
|
157 |
+
# # Load vectors in Word2Vec text format
|
158 |
+
# word_vectors = KeyedVectors.load_word2vec_format(vec_file_path, binary=False)
|
159 |
+
# print(f"\nLoaded {len(word_vectors.key_to_index)} vectors from .vec file using Gensim.")
|
160 |
+
# # Example: Find similar words using Gensim
|
161 |
+
# # print(word_vectors.most_similar("اردو"))
|
162 |
+
# except Exception as e:
|
163 |
+
# print(f"Error loading .vec file with Gensim: {e}")
|
164 |
+
|
165 |
+
|
166 |
+
else:
|
167 |
+
print("\nModel could not be loaded. Usage examples are skipped.")
|
168 |
+
|
169 |
+
|
170 |
+
**Steps after creating the Model Card content:**
|
171 |
+
|
172 |
+
1. **Create a Model Repository on Hugging Face:** Go to huggingface.co, log in, click your profile picture -> "New model".
|
173 |
+
2. **Name your Model:** Choose a descriptive name (e.g., `urdu-fasttext-word-embeddings`).
|
174 |
+
3. **Set Visibility:** Choose Public or Private.
|
175 |
+
4. **Create Model:** This creates an empty repository.
|
176 |
+
5. **Upload Files:** Go to the "Files" tab of your new repository. You can either:
|
177 |
+
* Click "Add file" and upload `urdu_fasttext.bin`, `urdu_fasttext.vec`, and your training script file.
|
178 |
+
* Or, clone the repository locally and push the files using Git.
|
179 |
+
6. **Edit Model Card:** Go to the "Model card" tab. This is where you paste and format the content prepared above. You can edit it directly in the browser using Markdown.
|
180 |
+
7. **Fill in Placeholders:** Go through the content and replace all `[ ... ]` placeholders with your specific details (vector size, epochs, dataset source, license, your name, etc.).
|
181 |
+
8. **Format with Markdown:** Use the formatting options (headers, bold, code blocks) to make the card readable.
|
182 |
+
9. **Save Model Card:** Save the changes.
|
183 |
+
|
184 |
+
Your model will then be available on Hugging Face with the documentation you've provided.
|